NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 6978 Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements

Reviewer 1

This submission proposes a new technique for Top-K frequent elements (aka "heavy hitters") problem, with a particular focus on practical implementation concerns of approximation accuracy, update speed, and suitability for both single-machine and distributed parallelism. The crux of the idea is to embed the "Frequent" algorithm as a subroutine into the Count-Min-Sketch (CMS) approach in order to avoid the update expense and inaccuracy of the "candidate set" auxiliary heap data structure. The theoretical results show good approximation, and the experimental results show good accuracy and performance in practice. The paper is clearly written, the experimental setup is thoroughly described, and the technical details seem sound (I read the supplement but did not work through the proofs). It seems that this approach could find usefulness in practical applications, but it is not obvious to me that it will have a large research impact. It would strengthen the work to further explore or explain how and why the gains of this approach will influence future research directions. One possible improvement is to compare against baselines other than Frequent (2003) and CMS (2005) - it seems unlikely there have not been other variations or extensions in this problem domain targeting the distributed setting? L210: the text mentions "the risk of slightly more approximation" due to using LHHcounts instead of the standard CMW counter but I didn't see anything in the main submission (or supplement) quantifying this? In general for the setting considered here are we concerned primarily with identifying the correct Top K, or also with the accuracy of their estimated frequencies as well? L305: another naive baseline might be to simply do CMS with heaps of size C*Kfor some constant factor C (eg, C=3). UPDATE: I have read the author response and found it useful. Ideally the final submission could incorporate some of this info, in particular question of whether the problem setting is "Identify Top K elements" vs the actual counts (which Reviewer #2 also found confusing) and the clarification about the 4K heap size for the experiments.

Reviewer 2

This paper proposes a novel algorithm for find Top-K frequent elements, the algorithm is based on the well-known Count-Min Sketch method and combine the Frequent based counting algorithm for better reducibility and parallelism. The authors also provide good theoretical results on their algorithm. Personally I think this paper is well-written and solving an important problem, however the major reason of me giving a weak rejection is that the experimental results seem to be not finished, the figure 1a - 1e only compare the CMS and FA methods and no Topkapi related results is reported. And to be honest these figures are extremely difficult to see because the font size is too small, and in figure 1f, I just cannot tell which part is the "Merging Thread Local Summaries". UPDATES: The authors did clarify the results reported in the figures, these results are able to demonstrate the parallelism in the Top-K frequent problem is improved with proposed algorithm. Minor Comments: - In line 107, over $D_1 \cup D_2$ should be $S_1 \cup S_2$. - There are no definition for $\epsilon$ and $\delta$, although their meanings can be inferred from the references.