*Submitted by Assigned_Reviewer_1*
**Q1**: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes using HF for learning a MDRNN with a CTC loss on top. The contributions of this work are to show that a loss that is non-convex (CTC) can still be used within Hessian Free by means of decomposing the loss using similar tricks than Gauss Newton matrices (which are applied already in HF methods for NNs). The authors show an elegant connection between their effort to make CTC suitable for HF, and an EM procedure.
The authors of this paper showed an effective way to train a very complex model (MDRNN), with a non-linear cost (CTC) using second order information (HF). The main theoretical contribution is to make CTC compatible with HF by proposing a GGN that is PSD, a requirement needed for CG within HF. Their contribution goes beyond that as they show interesting links with EM, and provide empirical evidence that the method is working.
I, however, have a few concerns regarding the experiments presented in this work:
-As the authors discuss, TIMIT is mostly an overfitting game and not so much and underfitting / optimization game. What was the training cost achieved by both methods?
-There are other baselines that may be more practical (i.e. less implementation details that may go wrong), such as Adagrad, RMSProp, or Adam. Could the authors provide such baselines? -It would be useful to add some prior art in the tables for comparison.
As a final note, there are quite a lot of grammatical and minor mistakes in the text. I'd suggest a deep pass over those before publishing this work.
**Q2**: Please summarize your review in 1-2 sentences
Despite some problems regarding grammar, this paper shows large benefits of training a very complex model (MDRNN) with a quite complicated loss (CTC) using Hessian Free optimization. They compensate some technical difficulties (non-convexity of CTC) quite elegantly, and show rather strong evidence that their framework is working on two different tasks.
*Submitted by Assigned_Reviewer_2*
**Q1**: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a Hessian-free optimization method to train deep Multi
Dimensional RNNs in the context of connectionist temporal classification
using a local quadratic
approximation of the objective function The approximation is computed using a low-rank updates while maintaining positive-semidefinitivness. Experimental comparisons are provided for the Arabic hand-writing recognition benchmark and on the TIMIT phoneme recognition corpus. Significant quality improvements are observed for the former.
The technical concent of this paper is sound and the quality of writing is high. The paper discusses relations with the Fisher information matrix and with expectation maximization, but the most straightforward
L-BFGS baseline is not discussed. It would have been helpful if the paper had provided at least experimental comparisons with the L-BFGS baseline, not just with SGD. Other problem is that the networks studied in the paper have relatively low parameter count. It is unclear how the quality improvements and runtime scales for networks with higher parameter count.
**Q2**: Please summarize your review in 1-2 sentences
The paper proposes a Hessian-free optimization method to train deep multi dimensional RNNs in the context of connectionist temporal classification
using a local quadratic
approximation of the objective function. This is an interesting work but important practical and theoretical comparisons with other optimization methods like L-BFGS are missing.
*Submitted by Assigned_Reviewer_3*
**Q1**: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper discusses how the authors applied Hessian Free optimization to MDRNs.
For me the deal-breaker was that the experimental comparison was very insufficient.
The authors decided to train the baseline SGD to convergence (1000 iterations) even though that almost never gives the best results.
They say "Note that in order to guarantee the convergence, we selected a conservative criteria compared to the reference".
I can think of no other reason to do this other than to artificially handicap the baseline and to have some good-sounding results to report.
Also there is no discussion of relative speed of the two methods, no convergence plots, and no objective functions reported, so the reader really has no idea what is going on.
**Q2**: Please summarize your review in 1-2 sentences
The idea to apply Hessian Free optimization to that type of network potentially makes sense, there is weakness in the experiments.
*Submitted by Assigned_Reviewer_4*
**Q1**: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
A possible drawback is that the paper focuses a lot on details of the original work [11], and it is not that easy to judge the novelty of the authors' contributions related to [11].
**Q2**: Please summarize your review in 1-2 sentences
Light review of Paper #570: "Hessian-Free Optimization For Learning Deep Multidimensional Recurrent Neural Networks"
The paper presents a sound idea to improve Hessian-free optimization for multi-dimensional recurrent neural networks combined with connectionist temporal classification for sequence labeling tasks. Overall, it is a clearly written paper with a convincing evaluation.
*Submitted by Assigned_Reviewer_5*
**Q1**: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Table 1 looks convincing. But vanilla SGD is of course only one of many options which could be applied to this kind of optimization. For example, the sum-of-functions optimizer of Sohl-Dickstein et al. (2014) also uses second-order information and could presumably be applied without modification.
Regarding Table 2: As you mention, one explanation of the poor performance of HF might be overfitting. This could easily be checked by looking at the performance on the training data, so why not include it in the results?
The definition of S (page 4) is not clear to me, why does B map to a set and not a label sequence?
**Q2**: Please summarize your review in 1-2 sentences
The paper's main contribution seems to be to make Hessian-free optimization of the CTC objective practical for MDRNNs. Based on a coarse reading, it is difficult for me to tell (a) what makes MDRNNs particularly difficult and why we couldn't just use the algorithm of Martens & Sutskever (2011), and (b) which aspects of Section 3 are new and which aspects are just a review of previous results.
**Q1**:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We would like to thank the reviewers (referred to as
R) for their time and effort, and appreciate their constructive comments.
Below, we attempt to address their comments to the best of our
ability.
R1: 1) "... TIMIT is mostly an overfitting game and not
so much and underfitting / optimization game. What was the training cost
achieved by both methods?"
To confirm this, we present the training
and validation set label error for the '15x8x10' network in Table 2. The
label errors are reported for every second epoch in case of HF, and for
every 50 epochs in case of SGD.
hf_train=[23 23.7 22.5 21.7 20.4
20.1 19.3 19.8 19.2 18.2 17.7 17.2 16.9 16.8 16.3 16.3 16 15.7 15.1 14.9
15 14.5 14.6 14.2 14.2] hf_val=[26.4 26 25.2 24.8 24.1 23.8 23.5 23.7
23.4 22.9 22.8 22.6 22.6 22.3 22.3 22.4 22.3 22.2 22 21.8 22 21.8 22.1
21.8 22.1] sgd_train=[26.1 18.6 14.1 12.2 9.9 8.4 7.4 6.2 5.3 4.7 4 3.4
2.8 2.7 2.2 1.8 1.5 1.3 1 1] sgd_val=[30.4 24.3 22.2 22 21.9 22 21.9
22.2 22.2 22.1 22.4 22.1 22.4 22.4 22.4 22.3 22.5 22.7 22.4
22.6]
With weight distortion [24], the validation error does not
decrease after some epochs. However, the training error keeps on
decreasing as the learning is continued for both HF and SGD. Please note
that HF was stopped using the stopping criteria on the validation
set.
2) "... other baselines that may be more practical ... Adagrad,
RMSProp, or Adam. ... useful to add some prior art in the tables for
comparison."
It would have been helpful to provide a comprehensive
baseline experiment. However, please note that as mentioned in [1,10], the
performance of various sophisticated gradient methods including RProp, CG,
and L-BFGS on training RNNs were not expected to be superior to online SGD
method, especially, with the gradient clipping technique used in our paper
[25, Pascanu (2013)]. For RNNs, the performance of HF method is expected
to be superior to other methods since it explicitly handles the vanishing
and exploding gradient problems [10]. We would like to add a few prior
arts for easier comparison.
R2: 1) "... experimental comparisons
with the L-BFGS baseline ..."
R1 also raises the same concern which
we have tried to address in our response above. Kindly refer to
those.
R3: 1) "The authors decided to train the baseline SGD to
convergence (1000 iterations) even though that almost never gives the best
results. They say "Note that in order to guarantee the convergence, we
selected a conservative criteria compared to the reference". I can think
of no other reason to do this other than to artificially handicap the
baseline and to have some good-sounding results to report."
We
respectfully clarify the misunderstanding about the experiments. The final
performance was evaluated using the weight parameters that showed the best
result for the validation set within 1000 iterations. We will modify the
context for more clear reading.
2) "... no discussion of relative
speed of the two methods, no convergence plots, and no objective functions
reported ..."
The original CTC objective was used as the cost
function throughout the paper. Regarding the convergence plots, please
refer to our response to R1.
Regarding the relative speed, at least
30 CG loops were performed per iteration for HF and one CG loop requires
1.2~1.5 times computation compared to one iteration of SGD. Considering
that we performed a brute force search of the learning rate and momentum
parameters for SGD, similar amount of computational resources were spent
for both methods.
R4: We appreciate your encouraging
comments.
R5: 1) "... (even if extremely toy) how exactly the
convex approximation of CTC influences things ..."
We appreciate this
insightful comment. A simple but nontrivial case is a network of 2 output
neurons with a sequence of length 2. Analysis of this simple network will
show how the convex approximation affects the geometry of optimization and
its convergence to exact solution. These results will be included in
Appendix.
2) "... any benchmark comparisons with other published
methods ..."
Please refer to our response to R1
above.
R6: 1) "... what makes MDRNNs particularly difficult ...
which aspects of Section 3 are new ..."
Section 3 explains the key
motivation of this paper and discusses why convex approximation of CTC
loss function is necessary to apply HF on MDRNNs. We will modify the
context to make it clear.
2) "... the sum-of-functions optimizer
..."
We found that the sum-of-functions optimizer is quite
interesting and would like to test its performance in the future. For
other options, please refer to the response to R1 above.
3) "... by
looking at the performance on the training data ..."
Please refer to
our response to R1 for the actual numbers.
4) "... why does B map to
a set and not a label sequence ..."
B maps to a label sequence. B is
a surjective mapping, and S is the image of B. |