NeurIPS 2020

Adam with Bandit Sampling for Deep Learning

Review 1

Summary and Contributions: This work proposes an extension of ADAM named ADAMBS which utilizes the bandit sampling for training examples. An obtained convergence rate of the proposed method is significantly better than that of ADAM in terms of the number of training examples, and the empirical speed-up also verified for learning CNNs, RNNs, and RCNNs. ---- After reading the response. I appreciate your hard work. I would like to raise the score.

Strengths: An improvement of the dependence on the number of training examples in the convergence rate is remarkable as well as the better empirical performance.

Weaknesses: It would be nice to compare the proposed method with other competitors (SGD, momentum, Adagrad, RMSprop, ...).

Correctness: The claims seem correct.

Clarity: The paper is well written and easy to read.

Relation to Prior Work: The relationship with related studies is well discussed.

Reproducibility: No

Additional Feedback: It would be nice if the authors could conduct additional experiments to veryfy the effectiveness of the proposed methods over existing methods (e.g., SGD, momentum, Adagrad, RMSprop)

Review 2

Summary and Contributions: The authors proposed a generalization of the ADAM optimizer for deep learning by sampling training examples according to their importance to the model's convergence. The sampling is performed using a multi-armed bandit algorithm. The authors showed that the new optimizer, ADAMBS, improves the convergence rate both theoretically and practically on different benchmarks.

Strengths: 1. The authors demonstrated that the proposed optimizer Adambs significantly improves both the convergence speed and the optimal value achieved in multiple benchmarks. Given that the computational complexity for machine learning tasks are becoming the acute bottleneck as we are training ever more complex model with huge amount of data, this is a significant result; 2. In addition to practical improvements, the authors also showed that in a simple setting, adding bandit sampling to Adam can improve the regret w.r.t. n, the number of training samples, by a factor of O(\sqrt(n/log n)). This is a significant speedup if the assumptions are satisfied since n can be a very big number in practice. 3. The presentation is clear with good reference to prior work, making it easy to identify the novelty and significance of the paper.

Weaknesses: 1. The comparison between ADAM and ADAMBS where the authors showed superior convergence is limited to a specific setting and assumed the feature vector follows a doubly heavy tailed distribution. It is unclear how important the doubly heavy tailed distribution assumption is to the result, and not very clear whether this assumption holds in practice. 2. While there is a significant improvement in terms of n, there is no difference in terms of d, which is the size of the feature vector. If we are training models where there are more parameters than training example, then the proposed method would not make significant difference compared to Adam.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This work proposes ADAMBS, an attempt towards improving convergence of ADAM. The proposed procedure is based upon sampling minibatches during training that can aid in convergence. The sampling distribution at each iteration is updated through a multi-armed bandit procedure. This work provides both theoretical analysis and empirical explorations to support their claim.

Strengths: The claims seem to be theoretically sound. The paper also provides empirical support for the claims. The work does have some novelty. The use of multi-arm bandits for learning sampling distributions at each iteration of ADAM seems useful (and seems novel in the context of ADAM). The mixture of ADAM and multi-arm bandits can motivate the broader NeurIPS community to apply/explore/expand on this approach for gradient based optimization problems (that have a very wide usage). Therefore, it does seem relevant to the broader NeurIPS community.

Weaknesses: The paper proposes an approach for sampling minibatches. It, however, does not explore the strength of this proposed approach in a more broader manner. Is there something specific to ADAM that makes the sampling approach more useful. Furthermore, setups such as curriculum learning can also perhaps provide adaptive minibatches during training. It does not seem that such avenues were explored as comparisons to the proposed setup.

Correctness: The claims seem to be correct. Standard/Known methodology seems to have been used during empirical explorations

Clarity: The paper does seem to be clearly written.

Relation to Prior Work: This work does discusses prior work in the area and tries to show how it is different from them.

Reproducibility: Yes

Additional Feedback: This work seems to propose an approach for sampling minibatches that can perhaps be applied to other procedures apart from ADAM. Therefore, apart from ADAM, was this approach (or suitable variants) explored (perhaps empirically) for other optimiztions procedures that involve minibatch? or is it that the proposed sampling approach is suitable only for ADAM-like procedures? What about curriculum learning? It can also be used to produce desired minibatches for better training. How does this approach compare to the state of the art in curriculum learning. In Algorithm 2 Line 2, what is L ? What is the influence of L and p_min in Algorithm 2? It seems that ADAMBS seems to converge to a lower loss value in most presented empirical explorations. Is that a typical trend or just an observation for the given data sets and architectures. If it is a common trend then was the influence of the sampling procedure on this phenomenon explored? perhaps minor: in line 99: f is dependent on the data. Therefore it is a function of more than just \theta. ##### after rebuttal ######## Thanks for the clarifications. ########################