Paper ID: | 786 |
---|---|

Title: | Sampling Sketches for Concave Sublinear Functions of Frequencies |

1. The submission reads well. The introduction part is written clearly with well-turned phrases. 2. The submission demonstrates the quality of the proposed sketch by comparing a resulting variance upper bound with that for PPSWOR. I am slightly concerned about the fact that upper bounds are compared here. Often, an upper does not reflect the true scale of a quantity. It would be better if the authors can derive a certain form of lower bound for PPSWOR, and show that the induced variance bound of the proposed method is not much larger. The submission indeed mentioned, in line 93, that the bounds for PPSWOR are "tight". I would like to know how tight these bounds are, i.e., are they provably near-optimal? or just hard to be improved further? 3. Line 202 mentioned that the new sketch scheme is "guided by the sketch ... due to Cohen [9]". I am slightly concerned about the novelty of the new approach. It would be great if the authors can clarify the similarities and differences between the two sketches. 4. The supplement is an extended version of the main paper. I think it would be better to separate the detailed technical parts from the main paper and put them in the appendix. In other words, make the supplement supplementary. The current form is fine as well. 5. It is mentioned in line 100 that "all (previous methods) require aggregating the data", while line 114 states that PPSWOR also "extends to the unaggregated datasets". It would be better if the authors could explain the latter claim more clearly. 6. Maybe better to explain the significance of Section 5 somewhere in the paper. I also feel that Section 5, according to its content, should not appear between Section 4 and 6. 7. The experimental results look good, and the improvement is significant. 8. Minor: line 127 "moments with" instead of "moments wtih"; line 187 better to emphasize the phrase "complement Laplace transform" for consistency; line 219 what is "conditional inclusion probabilities" referring to?

This paper considers the problem of sampling from an unaggregated dataset in the form of key-value (x,v) pairs. Let nu_x be the frequency of the key 'x' in the dataset and let f(.) be any concave sublinear function. A typical problem considered in this paper is that of estimating \sum_{x \in H} f(nu_x), for a subset of keys H. If one had access to the aggregated data, then the optimal way is to sample key 'x' with probability proportional to f(nu_x). When f(.) is an identity function, PPSWOR is a randomized algorithm using exponential random variables that achieves this even on unaggregated datasets. PPSWOR can be considered as optimal sampling procedure in this sense for any f(.) if applied to aggregated data. This paper considers all concave sublinear f and unaggregated datasets. The full algorithm requires two passes over the data. For the first pass, authors give a composable sketching algorithm to sample keys. Once a sample of keys is generated, authors do a second pass over the data to compute the associated estimators. They show that the variance of the estimator (of f(\nu_x)) so produced is at most a constant factor worse than that of PPSWOR applied to aggregated dataset. In addition, the space required by the algorithm is of the same order as PPSWOR in expectation. Any concave sublinear function can be written as an integral over 0 to oo, and following earlier papers, the authors write this as a sum of two integrals. The authors then give two separate sampling procedures to handle each part, and give a merging procedure to combine the two samples. The first summand is approximately equal to scaled frequency of the keys, and so PPSWOR can be used to produce a sample for this part. The second summand is handled by a procedure that authors call SumMax. SumMax (Section 5) is a procedure that samples (primary part of the) keys according to a sum of max values. I find the result to be interesting and potentially useful. The techniques are interesting. One thing that was not clear to me was which part was technically the most interesting/novel contribution of this paper given [9] (and possibly other papers). One main complaint I have is that the conference version of the paper does a very poor job explaining the various parts of the algorithm. Section 3 has some explanation of where other sections fit into the overall scheme. But Section 4 and Section 5 do not fit in to the narrative and there is no/very little explanation of why these are needed for the problem. The supplementary portion is a complete paper in itself and is well written. But it was very unclear to me why stochastic version of PPSWOR (section 4) and SumMax sampling (section 5) were required before looking into the supplementary section. --Post Author Feedback--- After looking at other reviews and authors' feedback, I will keep my score unchanged (or probably raise my overall score by 0.5 points). I appreciate the clarity of the feedback, and I am satisfied with their answers.

I find the problem well-motivated. The paper is fairly well-written and provides enough context before diving into details. The techniques are also interesting and the end-result resolves an important problem. There are some loose ends such as the large constants in the approximation factor. But overall, I see this paper in the accept regime.

Review Update: The author did a better job of explaining how the PPSWROR sketch, SumMax sketch, Sideline data structure, and gamma threshold work together in Algorithm 3 in the rebuttal. ------------------------------------------------------------------------- -- Summary -- Massive datasets contain elements represented as key, value pairs. The task is to compute statistics where each key is weighted by a function of its frequency. The paper specifically focuses on concave sublinear functions to mitigate the effect of keys with high frequencies. The paper designed a composable sampling sketch for any concave sublinear function while requiring space near to the desired sample size. The stochastic PPSWOR sampling and the auxiliary SumMax sampling sketch support generating a sample of keys whose expected weight is equal to the key's true weight. Naive Approach: 1. Aggregate all data into a table of keys and their frequencies. 2. Apply function f to frequency values 3. Estimate statistics using weighted sampling scheme Probability Proportional to Size and WithOut Replacement (PPSWOR): 1. Sample keys in proportion to their frequency 2. Estimate statistics using function f * Requires space equal to the number of distinct keys