NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 7742 Sampled Softmax with Random Fourier Features

Reviewer 1

The presented method is an extension, using Random Fourier Features instead of the approximation used in [12], for kernel-based sampling on the evaluation of the gradients for SoftMax layer. It is not clear, and it is not studied at all, how the assumption of bounded gradients affects the empirical evidence. Same as above for the normalization of the class embeddings and input embeddings. Moreover, it is argued by the reviewer that such a process does not correspond to a widely SoftMax use. ----- After reading the rebuttal by the authors, I believe that this submission is stronger, however comments 1 and 2 are not addressed. I tend to keep my scores.

Reviewer 2

Extreme multiclass classification is an important and challenging problem: the naive approach to training involves calculating the (gradient of the) softmax loss function, which scales as the number of classes. One approach to make training more efficient is sampled softmax; however, bias is introduced from the sampling distribution differing from the softmax distribution. The authors propose a sampling method with reduced bias, based on random Fourier features (RF-softmax). They prove a bound for the bias of a sampling method in terms of the multiplicative factor difference from the softmax distribution, and bound the multiplicative factor difference for RF-softmax. They show an improvement in efficiency and accuracy over other sampling methods, on benchmark datasets such as Penn Treebank and Delicious-200k. The result is of practical importance, and the writing and organization are clear. The comparisons are thorough and well-documented: the authors document the effects of varying the dimension $D$ and temperature parameters $\nu$ and $\tau$. Suggestions and questions (in decreasing order of importance): + What do you get by applying Theorem 1 to RF-softmax? It would be nice to show as a corollary to Theorem 1 and 2 what the bounds are for RF-softmax. Are the bounds purely of theoretical interest, or do they give some guidance in selecting the parameters? + Theorem 2 and Remark 2: Where is the dependence on $D$ coming from in $o_D(1)$? I don't see a dependence on $D$ in (17). (Mention how $\gamma_2$ depends on $\gamma_1$.) + What happens if the inner product $\phi(c_i)^T\phi(h)$ is negative? + How does RF-softmax compare to hierarchical softmax? It appears that hierarchical softmax does not suffer from the shortcomings of negative sampling, as one can compute an unbiased gradient in $O(d \log n)$ time. What are the advantages of RF-softmax over hierarchical softmax? + $q_j$ should provide a tight uniform multiplicative approximation of $e^{o_j}$." How tight is the bound/how confident are you that this is the right bound? + It would be good to show the calculation that verifies (15). Minor edits: + (line 48) iterations -> iteration + (50) [brace] -> [parenthesis] + (55) compare -> compared + (130) based *on* + (144) multiplication -> multiplicative + (397) definition of $S_l$ should have indices $s_j$, not $j$. --- Reply: Thanks to the authors for addressing the points raised by the reviewers. I hope that the answers to the questions will be made clear, and that the theorem statement for RFF will be included in the final version of the paper. I continue to recommend the paper for acceptance.