Paper ID: | 1924 |
---|---|

Title: | Practical Differentially Private Top-k Selection with Pay-what-you-get Composition |

Content: The authors’ explore the problem of DP top-k selection. They relax the requirements of the problem to allow the release of less than k of the top-k elements. This allows the authors’ to use a “noisy threshold” technique that allows them to produce a DP algorithm that does not need to know the entire domain. They further show that if the database contains k elements whose counts are significantly higher than the remaining elements then all k will be released with high probability. The algorithm also has an option of releasing a special symbol if at some point the remaining elements are all approximately equal. Originality: This work contains some original ideas. The major step of a noisy threshold seems somewhat familiar but it’s application in this setting is, to my knowledge, new. The step of reducing the effective domain size that is worked over is interesting. In practical applications, gaining constants in the privacy parameters is important. I have not seen an advanced composition theorem of this form before and since the authors’ prove it is applicable for the exponential mechanism with monotonic quality function, it seems broadly applicable. Quality: The paper seems technically sound. It contains full proofs and the results are clearly presented. I did not read the proofs in detail but the intuition is clear and a cursory look at the proofs showed no holes. Clarity: The paper is well-written. It is organized and easy to follow. The supplementary contains a larger version (presumably the arXiv version) that contains proofs. In the shortened version the authors’ do a good of providing the intuition of the proof. In my opinion, after reading the short version, an interested reader would have a good idea of how and why the results are true (although they may not be able to prove them without the supplementary). Significance: The authors’ consider an important problem in a more realistic context than has previously been considered. Unlike previous work, the algorithm presented in this paper does not require knowledge of the entire domain. This is important since in real-world systems, the algorithm rarely has access to even a description of the entire domain. It is difficult to achieve for a private algorithm because the unknown domain elements can appear in a neighboring database. The authors’ side-step this issue by only releasing elements that are clearly in the top-k, i.e. elements that definitely also appear in any neighboring database. The algorithm is also built to run on top of current systems that are already in place, rather than requiring specialized infrastructure. This is an important feature for speedy, large-scale deployment by non-experts. The paper also contains several results of independent interest. Perhaps the most interesting is the composition theorem for “eps-range bounded” mechanism. This allows them to shave constants off the advanced composition theorem for mechanisms that satisfy a slightly stronger condition than DP. They show that the exponential mechanism with a monotonic quality function satisfies this stronger condition. This allows them to prove a “pay what you get privacy” result (this result allows them to make iterative calls to their private top-k mechanism and essentially only pay for the number of elements output, rather than the number of calls made to the algorithm). They also present a proof of the fact that rather than using the exponential algorithm to iteratively “peel-off” the top-k results, one can simply add Gumbel noise to the quality scores and output the top-k noisy scores. This result has appeared in the literature before (e.g. “Private False Discovery Rate Control” [Dwork, Su, Zhang, arXiv ’15]) but is still a reasonably unknown result. Although it is not an original result, it’s nice to see it presently succinctly and clearly. Minor comments: - Typo, line 29. “In this work,…”

+ The problem is well-studied in the DP literature, but the paper still manages to provide a contribution that is novel and has practical relevance. The fact that the algorithm does not need to know the domain is great. + The algorithmic ideas are fairly natural and has Propose-test-release style flavor. Q: Can we include sparse-vector to characterize the privacy cost at a more fine-grained level? - It would be interesting to see a practical/empirical evaluation of the algorithms mentioend, while highlighting the savings in privacy cost, and compare it to existing results in the literature.

The paper contains many results, which is, unfortunately, counts as a weakness due to its poor organization. For example, there's an intermission on "bounded range composition", which introduces a strengthening of pure DP. Epsilon-range-boundedness ends up being a factor-2 approximation of pure-DP, and it is then shown to admit a slightly better (not even by a factor of 2) advanced composition theorem for a special kind of exponential mechanism. By itself, this is a useful observation (which appears already as a remark in the original McSherry & Talwar work) but does it really strengthen the paper? Overall the paper has several interesting ideas. First, it is the relaxation of the top-k objective. Second, it is the use of the Gumbel distribution whose connection to the exponential mechanism has not be fully explored. Third, it is the pay-what-you-get accounting mechanism. Everything else adds additional and unnecessary layers of complexity. It would interesting to frame the paper's approach using the language of privacy odometers and filters (https://arxiv.org/abs/1605.08294). The "Odometers" paper seemingly warns against the type of composition that the submission pursues; it'd be worthwhile to identify reasons why the submission avoids the problems presented by odometers. Another possible connection with previous work is "Scalable Private Learning with PATE" (https://arxiv.org/abs/1802.08908). The key primitive of that paper is top-1 query, and its main insight is that there are significant savings in the privacy budget can be extracted if the querying mechanism does not answer (returns a bottom) when there's no clear favorite.