Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
==== Based on the authors response, I find the comparison against gradient checkpointing they provide satisfactory. Please ensure it is included in the final draft ==== This work considers handling sequences of networks layers with identical weights (i.e. weight tied layers) via fixed point computation. Instead of directly computing the sequence, a quasi-newton method is used to approximate the fixed point of the sequence. This has the advantage that the gradient has a simpler form, although one which must also be computed iteratively. The advantages are: • Much lower memory usage as intermediate tensors do not need to be stored for use in the backwards pass. Approximately 4-10x lower for the considered models. • Empirically (sometimes) better perplexity compared to iterating. The disadvantages are: • About 3-5 times slower to train, 1.7-2x slower at evaluation time. • Significant additional implementation complexity • If there is no fixed point the method may not work. The paper is well-written and generally polished. I'm disappointed that the runtime comparisons are relegated to the appendix, it is poor scholarship to hide the disadvantages of your method. Reducing the memory requirements of language models is an important goal, as they are typically memory constrained during training for the largest SOTA models. A practical approach here could have significant impact. The experiments seem reasonable, the authors compare against current SOTA level models. It's unclear if there is any advantage compared to using gradient checkpointing. Checkpointing is discussed in the related work section but not tested against. Checkpointing requires only 1 extra forward pass, which is typically a 1.5x overhead, although the memory reduction may not be as significant (depending on the exact network architecture). The additional computational resources saved could be used to train a slightly larger model, which may reach similar perplexities. If this paper clearly showed that the approach was superior to checkpointing, it would be a much stronger submission.
I have read the rebuttal and the other reviews and maintain my score. I look forward to discussion of the (apparently since improved) training time in the main text. ======== The submission concisely presents a simple but powerful idea that will have impact. The DEQ model is a sensible model that draws on neural ODE-type ideas but stands on its own. (To my knowledge this is the first application of implicit depth ideas to sequence modeling). The derived algorithm is sensible, clearly explained, and the proofs appear correct (I checked them). The experiments seem reasonable and rigorous. The paper is well-written and exceptionally easy to follow. Overall I strongly recommend accepting this paper. A few questions: - in the paper, the authors state (and empirically show in the Appendix) that they can get away with a large \eps tolerance at inference time. How did they decide upon the \eps=1e-6, 1e-8 choices for training? (which one would expect may change the quality of the learned model) - How were the models initialized? I would expect this has substantial influence on the stability of training, since it may affect e.g. whether f is non-expansive. (From the code it looks like the initialization is pretty standard but this should be discussed in the main text) - the current version of the paper defers all discussion of runtime to the Appendix. Since the (slightly) slower runtime is a drawback of the model, it would be more fair to state this in the main text.
I have read the author response and am satisfied with it. As the other reviewers have pointed out, it would be good to move the discussion on training time to the main text. ----------------------------------------------------------------------------------------------------------------------------- This paper proposes a new approach to modeling sequential data called Deep Equilibrium Models (DEQ) whereby instead of using a weight tied deep neural network its fixed point is directly computed. Results show that this approach still gives the same perplexity as the deep neural network but has much lesser memory footprint. However the inference and training cost could be 2-4x times higher. The paper is written and organized nicely and is easy to follow. The idea itself seems novel and addresses an important problem of reducing the memory footprint of deep neural networks. However I am not sure what happens when the fixed point is unstable? Nevertheless the authors have done extensive experimentation both on synthetic and real-world datasets and have shown that DEQ reduces the memory footprint by more than 80% while giving the same perplexity.