Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Adam Smith, Shuang Song, Abhradeep Guha Thakurta
We revisit the problem of counting the number of distinct elements $\dist$ in a data stream $D$, over a domain $[u]$. We propose an $(\epsilon,\delta)$-differentially private algorithm that approximates $\dist$ within a factor of $(1\pm\gamma)$, and with additive error of $O(\sqrt{\ln(1/\delta)}/\epsilon)$, using space $O(\ln(\ln(u)/\gamma)/\gamma^2)$. We improve on the prior work at least quadratically and up to exponentially, in terms of both space and additive error. Our additive error guarantee is optimal up to a factor of $O(\sqrt{\ln(1/\delta)})$, and the space bound is optimal up to a factor of $O\left(\min\left\{\ln\left(\frac{\ln(u)}{\gamma}\right), \frac{1}{\gamma^2}\right\}\right)$. We assume the existence of an ideal uniform random hash function, and ignore the space required to store it. We later relax this requirement by assuming pseudorandom functions and appealing to a computational variant of differential privacy, SIM-CDP. Our algorithm is built on top of the celebrated Flajolet-Martin (FM) sketch. We show that FM-sketch is differentially private as is, as long as there are $\approx \sqrt{\ln(1/\delta)}/(\epsilon\gamma)$ distinct elements in the data set. Along the way, we prove a structural result showing that the maximum of $k$ i.i.d. random variables is statistically close (in the sense of $\epsilon$-differential privacy) to the maximum of $(k+1)$ i.i.d. samples from the same distribution, as long as $k=\Omega\left(\frac{1}{\epsilon}\right)$. Finally, experiments show that our algorithms introduces error within an order of magnitude of the non-private analogues for streams with thousands of distinct elements, even while providing strong privacy guarantee ($\eps\leq 1$).