NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 1888 A Regularized Framework for Sparse and Structured Neural Attention

### Reviewer 1

The authors investigate different mechanisms for attention in neural networks. Classical techniques are based on the softmax function where all elements in the input always make at least a small contribution to the decision. To this end, sparse-max [29] was recently shown to work well. It focuses on a subset of the domain, treating the remaining part as not contributing’. Here, the authors generalize this concept by applying a general smoothing technique which is based on conjugate functions of the max operator. The authors then show how to compute the required derivatives for training and how to employ structured regularization variants. The approach is validated on textual entailment (Stanford Natural Language Inference dataset), machine translation and sentence summarization (DUC 2004 and Gigaword dataset). The proposed regularization terms are explored on those tasks and advantages are demonstrated compared to classical softmax regularization. Strength: The paper investigates interesting techniques and convincingly evaluates them on a variety of tasks. Weaknesses: The authors argue that the proposed techniques often lead to more interpretable attention alignments’. I wonder whether this is quantifiable? Using computer vision based models, attention mechanisms don’t necessarily correlate well with human attention. Can the authors comment on those effects for NLP? Although mentioned in the paper, the authors don’t discuss the computational implications of the more complex regularization mechanisms explicitly. I think a dedicated section could provide additional insights for a reader.

### Reviewer 2

This paper presents a number of sparsifying alternatives to the softmax operator that can be straightforwardly integrated in the backpropagation of neural networks. These techniques can be used in general to approximate categorical inference in neural networks, and in particular to implement attention. The paper extends work such as sparsemax from its reference [29]. I believe this is a strong paper with significant contributions and thorough justifications. The paper presents: 1. The gradient of a general regularization of the max operator (Section 2.2). 2. Softmax and sparsemax as examples (2.3). 3. An original derivation of the Jacobian of any differentiable regularizer, and an example with the squared p-norm (3.1). 4. Two examples with structured regularizers, with algorithms for their computation and an original derivation of their Jacobians (3.2). One of them (fusedmax) is a novel use of the fused Lasso [39] as an attention mechanism. One may find a limitation of this paper in the fact that the experiments are carried out using networks that are not anymore the state of the art. As such, the paper does not claim any improvements over state-of-the-art accuracies. However, in my opinion this is not important since the aim is to replace the softmax operator in a variety of plausible architectures. The related work section is very clear and current, including a reference ([21]) on structured attention from ICLR 2017 that had immediately come to my mind. The structure discussed in this reference is more general; however, the contributions are very different. The paper also contains rich and descriptive supplemental material.