Reviews: Differentially Private Distributed Data Summarization under Covariate Shift

I find the problem that the paper addresses quite interesting and with potential practical value. As far as I know this specific problem has not been studied before in the differential privacy literature, although it is in spirit similar to other problems that have been extensively studied, e.g. private query release. The protocol that the paper proposes adapts in a non-trivial way the non-private greedy algorithm and I find interesting the combination of the different parts: the two hash functions and the auction. On the negative side, in some places the paper is quite unclear. For instance I find the trust model not discussed enough, since there are different parties involved I think it would be good if the paper could better explain their level of trust and their relations. Similarly there are several inconsistenties that make the results difficult to follow, e.g. in line 228, where is the j used? Another aspect I find quite lacking is the experimental evaluation. The experiment the paper presents are rather limited in scope. I would have felt more confident if they were repeated for different values of the privacy parameters and over different data. In summary, I find this a rather interesting paper with some negative aspects but leaving an overall positive feeling. Pros: -problem which can have interesting practical applications. -the proposed protocol combines different interesting ideas, including an auction mechanism Cons: -at time the paper shows imprecise writing -the trust model is at time unclear -the experimental evaluation is rather limited Comments after rebuttal ---------------------------- Thanks for your answers. They helped in clarifying several concerns I had. Please, add a more complete discussion of the threat model to the paper, similarly to what you wrote in the rebuttal. In general, I find that more explanations would help in improving the understanding of you our paper.

The paper approaches an interesting algorithmic question in differential privacy: how to release summary datasets of many distributed, private datasets? It presents what appears to be a promising approach for this problem. There is potentially algorithmic contributions of broad interest, however the presentation leaves details unclear, while the threat model - a key leg of the framework - of parsimony, is not clearly distinct from the central model as currently presented. The goal of threat models between local and central is really compelling, and quality contributions of such kind would be really significant. An existing, newish approach is shuffle-based approaches: interest has rapidly grown. (Notably, this direction isn't cited as another approach to going between local/central.) The basic idea here is that we have a trusted central aggregator in the typical central model. However the aggregator limits its access to provider datasets - not retrieving many points in total. This seems like a nature security property. However how much is "little"? There is little discussion of insight offered towards this point. A lower bound - that no other solution for the problem could achieve as few accesses - would provide some confidence of being truly "parsimonious". But then, perhaps another measure of discrepancy could be computed with even less access. Due to clarity (see below) it was difficult to tell if these accesses in total were evenly spread (certainly winners are not; so I think not) - is it possible that the aggregator could mislead the data holders into each revealing a large amount of data? Put another way, there doesn't appear to be a certificate that the aggregator indeed sticks to its access promise. The aggregator is assumed honest-but-curious, and is as trusted as the central model's aggregator. Were this trust misplaced, much of the approach would not be needed. The motivation and introduction extensively describes the desire to build models on distributed, private data. This is an "easier" problem than described - no need to share a dataset, instead do federated learning, share a model. There's a mismatch here. The first hashing technique, leveraging the Rahimi & Recht random Fourier basis, has been used extensively in DP. Independently by Sarwate et al. and Rubinstein et al. (tech reports published approximately 2009; later in JMLR and J Privacy and Confidentiality in 2011/12) in papers on regularised ERM and SVMs under DP. Specifically almost the same application appears there: to avoid sharing data for kernel evaluations (there in the dual solution of the SVM), one uses the data-independent random Fourier basis and shares a DP primal solution. No prior work in DP using Rahimi & Recht is cited. In parts, the paper is very clear. Much of the protocol, proofs, etc. are clear. However in places, clarity is quite poor. One "unimportant" place is line 143 where suddenly objective J(D_s) appears out of nowhere. Given the first term is a constant wrt the decision variable, it doesn't really matter. Capital $V$ line 145 isn't really defined AFAIK; and $N$ in line 147 appears before definition (few lines later). The use of the Rahimi & Recht idea requires translation-invariant kernels; while the paper uses RBF, there's an implication it might work more generally - this condition would be easy to include. How should the privacy sub-budgets be allocated? But there are unclear parts that are more critical to the approach - both checking it, and for the paper to meet its (possibly excellent?) potential: privateauction is not clearly defined. For instance, it apepars that the aggregator runs the exponential mechanism with quality score, the rank of bids (though it isn't described in this way); how would a point be chosen tau times? Not clear how this happens in Step 6 as pointed out. As a related point the PrivAuction appears similar to the exp mechanism, and indeed recall McSherry & Talwar introduce it in the context of mechanism (auction) design. The critical Thm 2 is presented with no insight or discussion whatsoever. Another form of clarity is grammar and more broadly writing style (avoidance of awkward or odd phrasing). The paper is OK in this respect but could do with a few passes by native English speakers: * 2: very less data -> very little data * The abstract is really quite long and detailed for a 8+2pg paper. Comes in at half a page. * I don't think the problem setup really is best described as an "AI Marketplace". This phrase suggests something much broader than was is being built. This could be a "distributed data broker" - return a dataset based on a query dataset. Or something suitably more narrow. * Line 37 "requires a huge" * Line 137 to the end of Sec 2 doesn't belong in the "Problem Setting" section as this covers a sketch of the solution. Suggest reorganising with a new section or similar. More minor points * Why typeset the number of private sources $K$ in bold (e.g. line 6 and onwards)? It's just a scalar. * Citations are regularly formatted inappropriately. No need to have brackets within brackets like line 34 [Gottlieb & Khaled (2017)] COMMENTS AFTER AUTHOR RESPONSE Thanks to the authors for their thoughtful rebuttal. I have read this and the other reviews carefully. There are good clarifying comments therein that do allay many concerns. The argument about the natural lower bound is reasonable, while incentivization is really important in distinguishing the threat model to the central DP model. Clarification on the auction vs exp mechanism is helpful, thankyou. And thanks for the receptive response to comments on Rahimi & Recht, and federated learning. Overall these comments do motivate a higher rating for the paper.

Paper ID:	8195
Title:	Differentially Private Distributed Data Summarization under Covariate Shift

Reviewer 1

Reviewer 2

Reviewer 3