Paper ID: | 7237 |
---|---|

Title: | Near Neighbor: Who is the Fairest of Them All? |

The writing up to page 4 is very wordy and I believe could be more succinctly written. First of all, I am not entirely sure why the problem of sampling from sub collection of sets need to be repeated twice in the paper. In addition, the paper should clearly state that the new sampling strategy can be embedded in the existing LSH method to achieve unbiased query results. Nevertheless, the algorithm does seem interesting - the key bottleneck of estimating the degree of a particular point (basically the number of distinct buckets that contain it) is identified and there are interesting solutions based on existing work in this paper. Section 3 - line 161 states the union of $G$ but $G$ is a set of sets. I can see that the part after the equal sign does make sense. I see that there are similar notational mistakes throughout the paper. Section 4 - line 226: what is $x$ and its relationship to $p$? Section 5 - Please clarify the following experiment setting: "Our data set contains the first 10000 points in the MNIST training data set..." Does choosing the first X points in the dataset help reproducibility? I am not sure why randomization was not used. Also should comment the distribution of the classes in the dataset that was used in the experiment. - Comparison is largely in terms of quality of the query results. Could the authors comment on the querying time and time-accuracy tradeoff? Originality - The sampling framework seems novel. Quality - The paper overall has average quality and somewhat incomplete experimental results. Clarity - The paper could be re-written to deliver the message more succinctly because some definitions are repeated throughout the paper. Significance - The problem/framework the paper discusses is of practical importance but because the paper is more empirically oriented (although the authors may disagree and do provide proof sketches in the appendix), it should provide more compete experimental results.

This paper considers a "fair" variant of the nearest-neighbor problem: given a set of points D, one aims to build a data structure so that, given a query point q, and a radius r, the data structure can be used to return an (almost) uniform at random point of the intersection between D and the ball of radius r centered on q. Such a data structure could be used to avoid "unfair" applications of nearest neighbor approaches (e.g., applications were the same point was to be returned at every query). The authors present a reduction showing that, if one has a LSH that returns a c-approximate NN in time T, using space S, then one can turn that algorithm into one that returns an almost UAR point in the (approximate) ball of the chosen radius with space O(S) and expected time O(T). To prove the reduction the authors introduce a new problem that they solve efficiently: a collection C of sets is given at the outset, and given a subcollection S \subseteq C, one has to return a UAR element from the union of the sets in S. The authors show that one can answer such a query in time O(|C|), by rejection sampling (together with an efficient algorithm for approximating an element degree). The authors then apply the solution to this problem to the fair NN problem, obtaining the aforementioned result (together with some variants). Finally, the authors test their algorithm on a number of datasets. I think this is a very neat paper, and I recommend acceptance.

In this paper the authors provide a new formulation for fair nearest neighbors. The key idea is that instead of reporting the nearest neighbor, or an approximate nearest neighbor, one reports uniformly at random a point within the ball of radius r. The authors provide a data structure of independent interest that essentially samples according to the l0 norm (#distinct elements) in the union of a query. This primitive is used in combination with an LSH black box to design the desired data structures. Then, the proposed method is evaluated on a real-world dataset. From a technical point of view, section 3 and the corresponding formulation are the main technical contribution in my opinion, as well as the fair NN formulation the main originality of the paper. Overall, it is also well written. [I thank the authors for the detailed response. I have upgraded my score from 5 to 6]