Paper ID: 410
Title: Attention-Based Models for Speech Recognition
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper introduces the attention-based recurrent sequence generator for phone recognition. The main contribution of the paper is the introduction of a different normalisation method for the attention vector (that works better for this task) and a "hybrid" attention method that takes into account the preceding attention vector (also improves results on this task).

The paper is well written and clear. Has an adequate 'related work' section and a sufficiently thorough experimental section that supports the claims stated at the beginning of the paper.

However, the novelty of the method presented is very limited. Only two small modifications to the attentional mechanism of the neural machine translation methods.

The paper makes no mention of the computational requirements of the method presented. This is of great importance to the speech recognition community. The use of a method that has access to the whole input sequence at every output step is necessary for machine translation due to the different syntactic rules used by different languages. However, for phone recognition this method is probably overkill, as coarticulation and related effects only have local influence in time. It would be interesting to see this methods adapted to a rolling window approach, which would probably be a lot more computationally efficient.
Q2: Please summarize your review in 1-2 sentences
The paper is sound but the novelty limited.

I recommend acceptance but the paper might find a more interested audience in ICASSP or Interspeech.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper applies recently proposed attention based model from machine translation to speech recognition.

The biggest criticism I am having is an experimental section using TIMIT dataset. It's not the best choice for dealing with ASR task due to several reasons. First, it's (almost) perfectly phonetically balanced (very unusual for real data) and was originally designed to some linguistic/dialect studies rather than anything connected with ASR. Second, it is almost perfectly clean read-speech with any real mismatch between training and testing conditions and third, you cannot really draw strong conclusions based on it. As a result, there is a bunch of ASR techniques working on TIMIT (and not working on anything else) and bunch of techniques working on anything else (but not on TIMIT, for example, sequence discriminative training). As your paper tries to address ASR problem explicitly, I am not sure whether your approach is better when compared to other CTC approaches proposed to date or its just yet another TIMIT artefact.

Given you would try your model on anything more challenging (even something like WSJ as in Alex Graves' ICML paper) I would be totally positive about this work. Reporting on TIMIT is all right given the findings are further strengthened on more challenging ASR benchmark.
Q2: Please summarize your review in 1-2 sentences
An interesting paper towards promising CTC acoustic model, but not quite yet there.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper applies attention-based models to speech recognition. The model is extended by considering location awareness, sharpening and smoothing of the scores, and windowing. These extensions overcome the long sentence issues in attention-based models, and achieved the improvement from the straightforward applications of the attention-based models.

Quality: This paper describes an original attention-based model, and mathematically explains the extension of a location-awareness score function in Eqs. (5) - (9). The other extensions (sharpening and smoothing) are also mathematically described by considering the long utterance issues. Therefore, the paper has sufficient quality in terms of mathematical/theoretical formulation.

Clarity: The paper clearly describes an attention-based model, issues of long sentences, and analysis of experimental discussions.

Originality: Although the novelty of this paper is incremental, each step of the incremental extensions is reasonably supported by the experimental results and analysis.

Significance: Since the task is relatively small (TIMIT phone recognition), the experimental result is not so significant. Although the proposed method uses a window technique, introducing \alpha requires long-range computations in training and decoding, and the proposed method does not seem to be scalable to large-scale (practical) speech recognition tasks.

I summarize the overall Pros. and Cons. as follows: Pros: Attention-based novel architecture provides end-to-end speech recognition. Cons: Scalability.

Minor comment: 1. P.2, Section 2: L and T must be explicitly defined.
Q2: Please summarize your review in 1-2 sentences
A novel attention based model with location awareness is applied to speech recognition. By carefully analyzing an issue for long utterances, the extended model obtains further gain from a conventional attention-based model.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We are very grateful for reviewing our submission and for the comments.

First, we would like to clarify the position of this work in the body of machine learning research. Prior to this work, the literature on attention-based RNNs missed an analysis of applicability of the approach to long input sequences, and most importantly to sequences longer those seen during training. Our work fills this important gap. It presents an analysis of the drawbacks of the existing approaches and proposes an effective and generic solution. Thus, we believe that this work belongs to a general machine learning conference, like NIPS.

On the other hand, our experiments delivered a novel speech recognition architecture, which works on par with the existing approaches.

We will further respond directly to the criticism of our work in the following aspects:

1. Novelty
We do combine ideas from A. Graves' work on sequence generation, Neural Turing Machines (NTM) and from D. Bahdanau's et al. work on machine translation. However, we introduce novel convolutional features that are arguably better than previous location-awareness proposals: they are fully trainable, straightforward to extend to many dimensions, and easy to implement. An attention mechanism with convolutional features does not involve predicting the shift to the next location, which we argue against in the subsection 2.1. The convolutional features are also in a sense deep, as opposed to linear interpolation of content- and location- based components in NTM.

We also believe that the investigation of failure modes of the both location aware and unaware attention mechanisms, along with the resulting notion of alignment sharpening, is also a crucial contribution. The content-based attention mechanism allows for more varied alignments than those produced by other techniques since alignments need not be monotonic. This brings in other failure modes. We demonstrate that the location-unaware model is able to learn to implicitly but not robustly track its position, which makes it fail in a completely different way from the location-aware model. Knowledge on how to apply models trained on short sequences to longer ones is important for other users of attention-based models. This need seems to be justified e.g. by a recent contribution from Google - http://arxiv.org/abs/1508.01211v1 - which reports failures of decoding on long utterances, however without proposing any solutions.

2. Value of experiments conducted on the TIMIT dataset
We chose TIMIT because it is popular, small, and well known. Our main concern was how the attention mechanism will perform when asked to align long sequences and whether it will be able to generalize to even longer ones. The speech recognition specific problems with TIMIT, such as low noise or balanced phoneme set, are less important when the main goal is to provide a difficult benchmark for attention-based RNNs.

That being said, we are working currently with the WSJ and Librispeech corpuses, which are substantially larger. We work on reducing the computational complexity making the approach more practical, e.g. we apply the windowing from subsection 2.3 during both training and testing. This model has a computational complexity that scales linearly with the length of the output sequence (as opposed to the one used for the TIMIT experiments, whose computational complexity scales with the product of the lengths of input and output sequences, or roughly quadratically with the length of the output sequence). On WSJ, when trained directly on characters we achieve a low 7.8% character error rate on the test_eval92 split with no language model used during decoding, and when decoded with the standard bigram WSJ language model it reaches the performance of CTC models of about 15% word error rate. Still, no initial alignments are needed and a conventional ASR system are not required.

The attention-based model also offers many new possibilities for speech processing. For instance it discards the typical requirement that each phoneme requires at least three speech frames. This allows pooling over time and we have indeed succeeded in reducing the recording length up to 16 times with very similar decoding performance.