Part of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)
Vincent Cohen-Addad, Niklas Oskar D. Hjuler, Nikos Parotsidis, David Saulpic, Chris Schwiegelshohn
We consider classic clustering problems in fully dynamic data streams, where data elements can be both inserted and deleted. In this context, several parameters are of importance: (1) the quality of the solution after each insertion or deletion, (2) the time it takes to update the solution, and (3) how different consecutive solutions are. The question of obtaining efficient algorithms in this context for facility location, $k$-median and $k$-means has been raised in a recent paper by Hubert-Chan et al. [WWW'18] and also appears as a natural follow-up on the online model with recourse studied by Lattanzi and Vassilvitskii [ICML'17] (i.e.: in insertion-only streams). In this paper, we focus on general metric spaces and mainly on the facility location problem. We give an arguably simple algorithm that maintains a constant factor approximation, with $O(n\log n)$ update time, and total recourse $O(n)$. This improves over the naive algorithm which consists in recomputing a solution at each time step and that can take up to $O(n^2)$ update time, and $O(n^2)$ total recourse. These bounds are nearly optimal: in general metric space, inserting a point take $O(n)$ times to describe the distances to other points, and we give a simple lower bound of $O(n)$ for the recourse. Moreover, we generalize this result for the $k$-medians and $k$-means problems: our algorithm maintains a constant factor approximation in time $\widetilde{O}(n+k^2)$. We complement our analysis with experiments showing that the cost of the solution maintained by our algorithm at any time $t$ is very close to the cost of a solution obtained by quickly recomputing a solution from scratch at time $t$ while having a much better running time.