{"title": "Fully Dynamic Consistent Facility Location", "book": "Advances in Neural Information Processing Systems", "page_first": 3255, "page_last": 3265, "abstract": "We consider classic clustering problems in fully dynamic data streams, where data elements can be both inserted and deleted. In this context, several parameters are of importance: (1) the quality of the solution after each insertion or deletion, (2) the time it takes to update the solution, and (3) how different consecutive solutions are. The question of obtaining efficient algorithms in this context for facility location, $k$-median and $k$-means has been raised in a recent paper by Hubert-Chan et al. [WWW'18] and also appears as a natural follow-up on the online model with recourse studied by Lattanzi and Vassilvitskii [ICML'17] (i.e.: in insertion-only streams).\n\nIn this paper, we focus on general metric spaces and mainly on the facility location problem. We give an arguably simple algorithm that maintains a constant factor approximation, with $O(n\\log n)$ update time, and total recourse $O(n)$. This improves over the naive algorithm which consists in recomputing a solution at each time step and that can take up to $O(n^2)$ update time, and $O(n^2)$ total recourse. These bounds are nearly optimal: in general metric space, inserting a point take $O(n)$ times to describe the distances to other points, and we give a simple lower bound of $O(n)$ for the recourse. Moreover, we generalize this result for the $k$-medians and $k$-means problems: our algorithm maintains a constant factor approximation in time $\\widetilde{O}(n+k^2)$.\n\nWe complement our analysis with experiments showing that the cost of the solution maintained by our algorithm at any time $t$ is very close to the cost of a solution obtained by quickly recomputing a solution from scratch at time $t$ while having a much better running time.", "full_text": "Fully Dynamic Consistent Facility Location\n\nVincent Cohen-Addad,\n\nCNRS & Sorbonne Universit\u00b4e\n\nNiklas Hjuler,\n\nUniversity of Copenhagen\n\nNikos Parotsidis,\n\nUniversity of Copenhagen\n\nvcohen@di.ens.fr\n\nhjuler@di.ku.dk\n\nnipa@di.ku.dk\n\nDavid Saulpic,\n\nEcole normale sup\u00b4erieure\n\nSorbonne Univerist\u00b4e\n\nChris Schwiegelshohn\n\nSapienza University of Rome\n\nschwiegelshohn@diag.uniroma1.it\n\ndavid.saulpic@lip6.fr\n\nAbstract\n\nWe consider classic clustering problems in fully dynamic data streams, where\ndata elements can be both inserted and deleted. In this context, there are several\nimportant parameters: (1) the quality of the solution after each insertion or deletion,\n(2) the time it takes to update the solution, and (3) how different consecutive\nsolutions are. The question of obtaining ef\ufb01cient algorithms in this context for\nfacility location, k-median and k-means has been raised in a recent paper by\nHubert-Chan et al. [WWW\u201918] and also appears as a natural follow-up on the\nonline model with recourse studied by Lattanzi and Vassilvitskii [ICML\u201917] (i.e.:\nin insertion-only streams).\nIn this paper, we focus on general metric spaces and mainly on the facility location\nproblem. We give an arguably simple algorithm that maintains a constant factor\napproximation, with O(n log n) update time, and total recourse O(n). This im-\nproves over the naive algorithm which consists in recomputing a solution after each\nupdate and that can take up to O(n2) update time, and O(n2) total recourse. Our\nbounds are nearly optimal: in general metric space, inserting a point takes O(n)\ntimes to describe the distances to other points, and we give a simple lower bound\nof O(n) for the recourse. Moreover, we generalize this result for the k-medians\nand k-means problems: our algorithms maintain a constant factor approximation\n\nin (cid:101)O(n + k2) time per update.\n\nWe complement our analysis with experiments showing that the cost of the solution\nmaintained by our algorithm at any time t is very close to the cost of a solution\nobtained by quickly recomputing a solution from scratch at time t while having a\nmuch better running time.\n\n1\n\nIntroduction\n\nClustering is a core procedure in unsupervised machine learning and data analysis. Due to the large\nnumber of applications, clustering problems have been extensively studied for several decades. The\nexisting literature includes both very precise algorithms[1, 18, 31], and very fast ones [34]. Due to\nthe importance of the task, clustering problems have also been studied in several computing settings,\nsuch as the streaming model [11] and the sliding-window model [7], the distributed model [4], in the\ndynamic model [24], and others.\nApplications nowadays operate on dynamically evolving data, e.g., pictures are constantly added and\ndeleted from picture repositories, purchases are continuously added into online shopping systems,\nreviews are added or being edited in retail systems, etc. Due to the scale and the dynamic nature\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fof the data at hand, conventional algorithms designed to operate on static inputs become unable to\nhandle the task for two main reasons. First, the running time of even the most ef\ufb01cient algorithms\nis too expensive to execute after every single change in the input data. Second, re-running a static\nalgorithm after every update might generate solutions that differ substantially between consecutive\nupdates, which might be undesirable for the application at hand. The number of changes in the\nmaintained solution between consecutive updates is called the recourse of the algorithm. Our study is\nmotivated by these limitations of static algorithms, or dynamic algorithms that are effective on only\none of the two objectives.\nMost fundamental problems in computer science have been studied in the dynamic setting. In a\nvery high-level, a dynamic algorithm computes a solution on the initial input data and as the input\nundergoes insertions or/and deletions of elements, the algorithm updates the solution to re\ufb02ect the\ncurrent state of the data. A dynamic algorithm may allow only insertions or only deletions, or may\nsupport an intermixed sequence of insertions and deletions, in which case the algorithm is called fully\ndynamic. The running time of a dynamic algorithm can either guarantee a worst-case update time\nafter each update, or a bound on the average update time over a sequence of updates which is called\namortized update bound. A dynamic algorithm with worst-case update bounds is the most desirable,\nand often hard to obtain, but in several applications algorithms with amortized update bounds are\nsuf\ufb01cient.\nMost of the clustering-related literature has focused on the online model where the updates are\nrestricted to insertions only and a decision cannot be revoked or on the streaming model where there\nis a speci\ufb01c memory budget not to be exceeded. However, as observed by Lattanzi and Vassilvitskii\n[30] the online model may appear too restrictive: if a bad decision has been made, it is often \ufb01ne to\nspend some time to correct it instead of suffering the bad decision (i.e.: keeping a bad clustering) for\nthe rest of the stream. However, spending too much time on the modi\ufb01cation of the clustering may\nbe counterproductive and that\u2019s what we aim at capturing in the fully-dynamic model with limited\nrecourse: keeping a good clustering by spending the least among of time and making as few changes\nto the current clustering as possible.\nIn this paper, we study fully-dynamic algorithms for classic clustering problems. In particular, we\nconsider the facility location, the k-means, and the k-median problems in the dynamic setting. In the\nstatic case, these problems are de\ufb01ned as follows. Let X be a set of n points, and d : X \u00d7 X \u2192 R\na distance function. We assume that d is symmetric and that (X, d) forms a metric space. For\nthe (k, p)-clustering problem, the objective function that we seek to optimize is Cp(X, S), where\nS \u2286 X,|S| = k. Setting p = 1 gives the k-median objective, and p = 2 the k-means one. For the\nfacility location problem the objective function is C(X, S).\n\n(cid:88)\n\nx\u2208X\n\nC(X, S) :=\n\nd(x, c) + f \u00b7 |S|\n\nmin\nc\u2208S\n\nCp(X, S) :=\n\ndp(x, c),\n\nmin\nc\u2208S\n\n(cid:88)\n\nx\u2208X\n\nAll of these problems are NP-Hard, so our best hope is to design algorithms with provable approxi-\nmation guarantees. There is an extensive line of work on algorithms with constant approximation\nguarantees for all three aforementioned problems [1, 3, 10, 20, 26, 27, 31, 33]. On the other hand,\nMettu and Plaxton [34] showed an \u2126(nk) lower bound for computing an O(1)-approximate solution\nfor k-median and k-means in general metric spaces.\nIn the dynamic setting, the goal is to maintain ef\ufb01ciently a good solution to the clustering problem at\nhand as the set of points undergoes element insertions and deletions. The main criterion for designing a\ngood dynamic algorithm for these problems is the quality of the clustering, with respect to the optimum\nsolution, at any given time. However, in many applications, it is equally important to maintain a\nconsistent clustering, namely a clustering with bounded recourse. Lattanzi and Vassilvitskii [30] have\nrecently considered consistent clustering problems in the online setting, where the points appear in\nsequence and the objective is to maintain a constant factor approximate solution while minimizing the\ntotal number of times the maintained solution changes over the whole sequence of points. Another\ncriterion that has been explored much less but which is highly important when dealing with massive\ndata is the time it takes to update the solution after each update so that the solution remains within a\nconstant factor from the optimum solution.\n\n2\n\n\f1.1 Our Contribution\n\nWe present the \ufb01rst work that studies fully-dynamic algorithms while considering the approximation\nguarantee, consistency and update time, all at the same time. From an input perspective, we consider\ngeneral metric spaces. Thus, an element of the input is a point in a metric space which is de\ufb01ned\nby the distances to the other points of the metric. The contribution of our paper is summarized as\nfollows:\n\u2022 We give a fully-dynamic algorithm for the facility location problem that maintains a constant factor\napproximate solution with constant recourse per update and O(n log n) update time. We moreover\nshow that constant recourse per update is necessary for achieving a constant factor approximation.\n\u2022 We extend the algorithm for facility location to the k-median and k-means problems. Here, our\n\nalgorithm maintains a constant factor approximate solution with (cid:101)O(n + k2)1 update time (Theo-\n\nrem 3.1). This is the \ufb01rst non-trivial result for these problems, as the only known solution for these\nproblems was to recompute from scratch after each update: this requires time \u2126(nk) for k-median\nand \u2126(n2) for facility location, per update. Hence, our time bounds are signi\ufb01cantly better than the\nnaive approach for a large range of k.\n\nEmpirical Analysis. We complement our study with an experimental analysis of our algorithm\non three real-world data sets and show that it outperforms the standard approach that recomputes\na solution from scratch after each update using a fast static algorithm. Interestingly, we show that\nthis barely impacts the approximation guarantee. At the same time, our algorithm outperforms by at\nleast three orders of magnitude the simple-minded solutions, both in terms of running time and total\nnumber of changes made in the maintained solution throughout the update sequence.\n\n1.2 Related Work\n\nOnline and Consistent Clustering. Online algorithms for facility location were \ufb01rst proposed by\nMeyerson [35] in his seminal paper. Fotakis [16] later showed that the algorithm has a competitive\nratio of O(log n/ log log n), which is also optimal. Additionally, the algorithm has a constant\ncompetitive ratio if the points arrive in a random order [35, 29]. There also exist O(log n) competitive\ndeterministic algorithms, see [2, 15]. This was recently extended to the online model that incorporates\ndeletions [12].\nOnline algorithms for clustering that are only allowed to place centers cannot be competitive. This\nled to the consideration of the incremental model, which allows two clusters to be merged at any\ngiven time. Work in this area includes [9, 14]. The number of reassignments (commonly referred\nto as recourse) over the execution of an incremental algorithm may be arbitrary. However, recently,\nLattanzi and Vassilvitskii [30] considered the online clustering problem with bounded total recourse.\nThey showed a lower bound of \u2126(k log n) changes over an arbitrary sequence of updates, and\npresented an algorithm that can maintain a constant factor approximation while limiting the total\nrecourse to O(k2\u00b7 log4 n). Their work differs to ours in that elements can only be added, and that they\ndo not consider optimizing the running time. In the fully dynamic case their bound on the recourse\ndoes not hold, and we moreover show that constant recourse per update is unavoidable.\n\nFully-Dynamic and Streaming Algorithms. Streaming algorithms for clustering can be used to\nobtain fast dynamic algorithms by recomputing a clustering after each update. Since streaming\nalgorithms are highly memory compressed and typically process updates in time linear in the memory\nrequirement, the approach automatically yields good update times. Low-memory adaptations of\nMeyerson\u2019s algorithm [35] turned out to be simple and particularly popular, see [8, 29, 37]. Another\ntechnique for designing clustering algorithms in the streaming models is by maintaining coresets, see\nthe following recent survey for an overview [36]. For fully dynamic data streams, the only known\nalgorithms for maintaining coresets for k-means and k-median in Euclidean spaces using small space\nand update times are due to Braverman et al. [6] and Frahling and Sohler [17]. There also exists some\nwork on estimating the cost of Euclidean facility location in dynamic data streams, see [13, 25, 28].\nFor more general metrics, the problem of maintaining a clustering dynamically has been considered\nby Henzinger et al. [22] and Goranci et al. [19] who consider the facility location in bounded doubling\n\n1(cid:101)O(\u00b7) hides polylog factors.\n\n3\n\n\fdimension. The arguably most similar previous work to ours is due to Hubert-Chan et al. [24]. They\nconsider the k-center problem in general metrics in the fully dynamic model. Here, they were able to\nmaintain a constant factor approximation with update time O(k log n) 2. Whether an algorithm in the\nfully dynamic model with low recourse and update times exists, was left as an open problem.\n\n1.3 Preliminaries\n\nWe assume that we are given some \ufb01nite metric space (X, d), where X is the set of points and\nd : X \u00d7 X \u2192 R\u22650 a distance function. Every entry d(a, b) is stored in a (symmetric) n \u00d7 n matrix\nD. Our algorithms work in the distance oracle model, which assumes that we can access any entry of\nD in constant time.\nOur input consists of tuples (X, Rn\u22650,{\u22121, 1}). The \ufb01rst coordinate is the identi\ufb01er of some point\np \u2208 X, the second coordinate is the column/row vector in D associated with p, and the last coordinate\nsigni\ufb01es insertion (1) or deletion (\u22121). We assume that the stream is consistent, which means that\nno point can be deleted without having been previously inserted. The adversary generating the point\nsequence is called adaptive if he can modify the sequence depending on the algorithm\u2019s choices.\nThroughout the paper, we let X t be the set of points present at time t, n be the total number of\nupdates, and n\u2217 := supt\u22081,...,n |X t| be the maximum number of points present at the same time. We\ndenote by OP T t the optimum solution at time t. All our results could be phrased in term of |X t|,\nbut for simplicity we present them in terms of n\u2217.\n\nRoadmap. Our paper is organized as follows. In Section 2, we describe our algorithm for fully\ndynamic facility location. Section 3 extends these results to k-median and k-means clustering. We\nconclude with an experimental evaluation of our algorithms in Section 4 on real-world benchmarks.\nAll omitted proofs can be found in the supplementary material.\n\n2 Dynamic Facility Location\n\nThe goal of this section is to prove the following theorem.\nTheorem 2.1. There exist a randomized algorithm that, given a metric space undergoing insertion\nand deletions of points, maintains a set of center St such that :\n\n\u2022 each update is processed in time O (n\u2217 log (n\u2217)) with probability 1 \u2212 1/n\u2217\n\u2022 at any given time t, C(X t, St) = O(1) \u00b7 C(X t, OPTt) with probability 1 \u2212 1/n\u2217\n\nt=0 |St(cid:52)St+1| = O(n), i.e. the amortized recourse is O(1) per update.\n\n\u2022 (cid:80)n\n\nThe proof is divided into several lemmas: we \ufb01rst study how the optimal cost behaves upon dynamic\nupdates, and we exhibit then an algorithm that maintains a solution whose cost evolves in a similar\nway as the optimum.\nAlthough, perhaps counter intuitive, removing a point from the input in a \ufb01nite metric may increase\nthe cost of a clustering, if one cannot locate a center there anymore. We show in the supplementary\nmaterial that this increase is bounded by a factor 2: this leads to the following lemma, which bounds\nthe evolution of the optimal cost.\nLemma 2.2. Let OPTbef ore be the optimal cost of an initial metric space X. After an arbitrary\nsequence of ni insertions and nd deletions of points in X, the optimal solution OPTaf ter satis\ufb01es\nOPTbef ore/2 \u2212 nd \u00b7 f \u2264 OPTaf ter \u2264 2(OPTbef ore + ni \u00b7 f )\nMaintaining a solution during a few updates. We now turn on designing an algorithm competitive\nwith the optimal solution, showing \ufb01rst how to deal with a small number of updates. In order to\nprocess deletions, we de\ufb01ne the notion of substitute centers: given a function s mapping every center\nfrom the initial solution to a center in the current one, we say that s(c) substitutes c. Initially, s(c) = c.\nWhen a center c is deleted, the algorithm opens a replacement center cr, and updates the function s:\ns(s\u22121(c)) = cr.\n\n2Under the common assumption that the ratio longest distance / shortest distance of the metric is polynomially\n\nbounded.\n\n4\n\n\fThe algorithm is as follows: when a point x is inserted, we open as a facility at x, and for convenience\nwe de\ufb01ne s(x) = x. When a point x is deleted, we have two cases: either x was not an open facility,\nin which case the algorithm does nothing, or x was a facility. In the latter case, let c = s\u22121(x) : the\nalgorithm opens the closest point c(cid:48) of c in X 0 that is still part of the metric, and set s(c) = c(cid:48). This\nchoice of c(cid:48) ensures that, for all points x in the current metric space, d(c(cid:48), c) \u2264 d(x, c).\nLemma 2.3. Starting from any metric space (X 0, d) and an \u03b1-approximation with cost \u0398, the\nalgorithm described above maintains a (8\u03b1 + 4)-approximation during \u0398\n4\u03b1f updates, with O(1)\nrecourse and O(n\u2217) time amortized per update.\n\nProof. This algorithm opens at most one new facility at every update: the recourse is thus at most 1.\nThe time to process an insertion is constant, and the time to process a deletion is at most O(n\u2217) (the\ntime required to compute the closest point to x).\nWe now analyse the cost of the solution produced after t updates. Since the recourse is at most 1 per\nupdate, the cost of open facilities increases by at most t \u00b7 f. Since every inserted points is opened as a\ncenter, it does not contribute to the connection cost: this cost changes therefore only by deletions of\npoints from X 0. Similarly to Lemma 2.2, one can show that the connection cost of a point x \u2208 X 0\nat most doubles. More formally, let c \u2208 X 0 be the center that serves x in the initial solution. Let\nc(cid:48) = s(c) be the center that substitutes c in the current metric. By the choice of c(cid:48) and triangle\ninequality, it holds that d(x, c(cid:48)) \u2264 d(x, c) + d(c, c(cid:48)) \u2264 2d(x, c). Hence, the total serving cost is at\nmost twice as expensive as in the initial solution. The cost at time t is therefore at most 2\u0398 + t \u00b7 f.\nLet t \u2264 \u0398\n4\u03b1f , and OPT0 the optimal cost in the initial state. By Lemma 2.2, the optimal cost at time t\nis at least OPT0/2 \u2212 t \u00b7 f \u2265 OPT0/4, since by de\ufb01nition of \u03b1 it holds t \u2264 OPT0\n4f . Moreover, the cost\n\u03b1 ) \u2264 OPT0(2\u03b1 + 1). Combining the two inequalities\nof our algorithm at time t is at most \u0398(2 + 1\ngives that our algorithm is a (8\u03b1 + 4)-approximation for all t \u2264 \u0398\n\n4\u03b1f , which concludes the proof.\n\nWe remark that the parameters can be optimized: for instance, with a suitable data structure, the time\nto \ufb01nd a substitute center can be logarithmic; however, this is dominated by the complexity of \ufb01nding\nthe initial \u03b1-approximation.\nMaintaining a solution for any number of updates. We combine Lemma 2.3 with a classic static\nO(1)-approximation algorithm, namely Meyerson\u2019s algorithm, to prove Theorem 2.1.\n\nProof of Theorem 2.1. We summarize here the useful properties of Meyerson algorithm, and refer to\nthe supplementary material for more details. The algorithm processes the input points in a random\norder, opening each point x with probability d(x, F )/f (where F is the set of previously opened\nfacilities). If the algorithm opens k facilities, its running time is O(kn\u2217), and the cost is at least\n\u0398 \u2265 kf. Hence, the running time is O(\u0398/f \u00b7 n\u2217). Moreover, one can assume that the cost is always\nat most n\u2217f (by opening a facility at every point).\nWe say that a run of Meyerson\u2019s algorithm is good if it yields a O(1)-approximate solution. By the\nanalysis in [35], a run is good in expectation (where the randomness comes from the random ordering\nof points): hence, by running log(2n\u2217) independent copies of the algorithm, at least one run is good\nwith probability 1 \u2212 (1/n\u2217)2. We let \u03b1 be the approximation constant of this algorithm.\nTherefore, our main algorithm works as follows: start with a solution given by Meyerson\u2019s algorithm\nof cost \u0398, use Lemma 2.3 to maintain a solution during \u0398\n4\u03b1f updates, and then recompute from\nscratch. We call the intervals between consecutive recomputations periods, and note that they are\nrandom objects: the length of a period is determined by the cost of its initial solution, which is a\nrandom variable.\nWe \ufb01rst analyze the running time of this algorithm. Within one period, Lemma 2.3 ensures that\nthe running time is O(n\u2217) per update. Moreover, the running time of the initial recomputation is\nO(\u0398/f \u00b7 n\u2217 log n\u2217), and the length of the period is \u2126(\u0398/f ). Therefore the amortized running time is\n\n(cid:101)O(n\u2217) per update. Since the initial recourse is O(\u0398/f ), the same argument proves that the recourse\n\nis amortized O(1) per update.\nWe aim at using again Lemma 2.3 to prove that, at a given time t, the solution is a constant factor\napproximation. For this, let P be the period in which t appears. If the period is good, then Lemma 2.3\n\n5\n\n\fconcludes. Unfortunately, the fact that t is in P is not independent of P being good (for instance, if\nP is very long it is unlikely to be good). However, note that the starting time of P cannot be before\nt \u2212 n\u2217: indeed, a period lasts for at most \u0398\n4\u03b1f \u2264 n\u2217 updates. Hence, if we condition on all\nperiods starting between t \u2212 n\u2217 and t, Lemma 2.3 applies and the solution at time t is a constant\nfactor approximation. Since any period is good with probability 1 \u2212 (1/n\u2217)2, all periods between\nt \u2212 n\u2217 and t are good with probability 1 \u2212 1/n\u2217 by union bound. This concludes the proof.\n\n4\u03b1f \u2264 n\u2217f\n\nThe algorithm sketched in the previous proof can be transformed so that the complexity becomes\n\n(cid:101)O(n\u2217) in the worst case, by spreading the recomputation over several updates (see supplementary\n\nmaterial). Moreover, randomization is not needed in order to maintain the solution (only to compute\na starting approximation): hence the algorithm works against an adaptive adversary.\nWe conclude this section by showing that our algorithm is (up to a logarithmic factors) optimal both\nfor update time and recourse.\nProposition 2.4. Any algorithm maintaining a O(1)-approximation for Facility Location must have\nan amortized update time \u2126(n\u2217) and total recourse \u2126(n), where n the total number of updates.\n\n3 Dynamic k-Median and k-Means in Linear Time\n\nIn this section we adapt the algorithm from Section 2 to handle the stricter problems of k-Median and\nk-Means. For simplicity, we call (k, p)-clustering the problem of \ufb01nding k centers that minimize Cp\n(p = 1 for k-Median and p = 2 for k-Means).\nRoughly speaking, our algorithm works as follows. We use an adaptation of the algorithm from\n\nSection 2 to maintain a coreset R of (cid:101)O(k) points that contain a constant factor approximate solution\n\nk(1+log n\u2217) gives a set S of 4k \u00b7 (1 + log n\u2217) \u00b7 (22p+1 \u00b7 Cp(X,OPT)\n\nand deletions of points, maintains a set of centers St with (cid:101)O(n\u2217 + k2) update time such that, for any\n\nfor the (k, p)-clustering problem. Then, we apply a constant factor approximation algorithm for\nthe metric (k, p)-clustering problem on the maintained coreset (e.g., we can use a quadratic-time\nlocal-search algorithm, see [21]). This yields the following theorem.\nTheorem 3.1. There exists a randomized algorithm that, given a metric space undergoing insertions\ntime t, Cp(X t, St) = O(1) \u00b7 Cp(X t, OPTt).3\nThe remainder of this section is devoted in proving Theorem 3.1. The main hurdle in applying the\nframework from Section 2 is that the optimum solution can change drastically with the addition\nor deletion of a point, and it is therefore not easy to adapt the previous amortization argument. To\novercome this barrier, we make use of the following lemma, from [9] and [30]:\nLemma 3.2. Let L be some integer. With probability 1/2, running Meyerson\u2019s algorithm for Facility\nLocation with f =\n+ 1) centers\nsuch that Cp(X, S) \u2264 L + 4 \u00b7 Cp(X, OPT).\nFor completeness, we provide the pseudocode of Meyerson\u2019s algorithm, adapted for our purpose,\nin Procedure M eyersonCapped. The lemma implies that, if we know a value L that approximates\nOPT within a factor 2, Procedure M eyersonCapped computes a set of points R and an assignment\nx\u2208X d(x, \u03c6(x)) \u2264 6 \u00b7 Cp(X, OPT) with probability 1/2. This probability can\nbe boosted to 1 \u2212 (1/n\u2217)2 by taking the union of q = O(log n\u2217) independent copies of the algorithm.\nTherefore for all i = 1, ..., q, our algorithm will use this lemma assuming Cp(X, OPT) \u2208 [2i, 2i+1),\nand taking L = 2i. This provides, for all i, a set Ri of O(k log2 n\u2217) centers.\nIt remains to maintain those sets dynamically. Similarly to Section 2, we use the solution computed by\nProcedure M eyersonCapped for the subsequent k updates, so that we can amortize the update-time\nbound. However, for (k, p)-clustering it is not possible to bound the cost of OPT after a few updates.\nWe overcome this obstacle by updating the sets Ri more carefully. More precisely, let Rt\ni be the\n(updated) set Ri after t updates of the algorithm. The algorithm ensures the following invariant:\nInvariant 3.3. The set Rt\ni has size O(k log2 n\u2217) and, with high probability, there exists i such that\nCp(X t,Rt\n\nof points \u03c6 such that(cid:80)\n\ni) = O(1) \u00b7 Cp(X t, OPTt).\n\nL\n\nL\n\n3We assume (as in [30]) that the minimum distance in the metric is 1 and the maximum \u2206 is bounded by a\n\npolynomial in n\u2217. Alternatively, our bounds can be stated with log \u2206 instead of log n\u2217.\n\n6\n\n\fInput: An integer L, a set of points X\nOutput: A set of centers R, an assignment \u03c6\nof point to centers, and tl the id of the last\ncenter opened\n1: Let R \u2190 \u2205 and x1, ..., x|X| be a random\norder on the points of X\n2: for all i \u2208 {1, ...,|X|} do\n3:\n\nif |R| < 4k \u00b7 (1 + log n\u2217) \u00b7 (22p+2 + 1)\nthen\n\nto R with\n\nadd xi\nprobability\nd(xi,R)pk(1+log n)\nif |R| = 4k \u00b7 (1 + log n\u2217)\u00b7 (22p+2 + 1)\nthen\n\nL\n\n4:\n\n5:\n\ntl \u2190 i\n\nend if\n\n6:\n7:\n8:\n9:\n10: end for\n\nend if\n\u03c6(xi) \u2190 arg min\nc\u2208R\n\n{d(xi, c)p}\n\n(a) MeyersonCapped(L, X)\n\na set of centers R\n\nInput: Integers L and tl, a set of points X and\nOutput: Updated R and tl, an assignment \u03c6 of\n1: {tl is the last time M eyersonCapped was\n2: for all j \u2208 {tl, ...,|X|} do\n3:\n4:\n\npoints to centers\ninvoked.}\n\nto R with probability\n\nif no center was opened yet then\n\nL\n\nadd xj\nd(xj ,R)pk(1+log n\u2217)\nif xj is added to R then\ntl \u2190 j, \u03c6(xj) \u2190 xj\nend if\n\u03c6(xj) \u2190 arg min\nz\u2208{\u03c6(xj ),xtl}\n\nelse {Update \u03c6}\n\n5:\n6:\n7:\n8:\n9:\n\n{d(xj, z)p}\n\nend if\n10:\n11: end for\n(b) DeletePoint(L, tl, X,R). L is the value with\nwhich we approximate Cp(X, OPT) and tl is the last\ntime MeyersonCapped opened a center.\n\nFor this, initialize Ri to be the union of the outputs of q = O(log n\u2217) independent executions of\nM eyersonCapped(Li, X), for Li = 2i and i = 1, ..., q. The algorithm updates these sets during k\nupdates before recomputing them from scratch. In the case of a point insertion, it suf\ufb01ces to add the\nnew point to all Ri: over k updates, this changes the cardinality by at most k while the cost remains\nthe same, and therefore the two conditions are met. The case of a point deletion requires more work.\nThe idea is, as in Section 2, to replace the deleted center by its closest point. However, this is not\nenough to ensure Invariant 3.3: this is taken care of by Procedure DeleteP oint, which \ufb01nds the next\npoint in X t that M eyersonCapped would open, if there was no constraint on the size of R.\nWe are now ready to describe our fully-dynamic algorithm for maintaining a constant-approximate\nsolution to the (k, p)-clustering problem. The algorithm uses Procedures M eyersonCapped and\nDeleteP oint as subroutines to build and maintain the sets Ri for i, and after each update calls the\nstatic constant-approximate algorithm to compute an approximate solution St\ni on each weighted\ninstance Ri (where the weight of each point x \u2208 Ri corresponds to the number of points of X t\nassigned to x by the function \u03c6i, computed in Procedures M eyersonCapped and DeleteP oint.\nAfter each update, the algorithm keeps the solution St\ni ), that is, St = St\ni )}. The pseudocode of the algorithm is stated in Algorithm 1. The proof\nfor i = arg mini{Cp(Rt\ni\nof Invariant 3.3 is stated in the supplementary material. The next lemma, together with Invariant 3.3,\nshows that St can be used as a solution for the entire set X t.\nLemma 3.4. Let OPT(Rt\nCp(X t, OPT(Rt\ni) = O(1) \u00b7 Cp(X t, OPTt), the\nThis proves the second part of Theorem 3.1: for i such that Cp(X t,Rt\nsolution computed on the set Rt\ni is a good approximation of the optimal solution, and therefore the\nalgorithm maintains a constant factor approximation. The bound on the running time being similar to\nthe one of Section 2, we provide it in the supplementary material.\n\ni) be the optimal solution in the weighted set Rt\n\ni that minimizes Cp(Rt\n\ni)) \u2264 23p\u22121(Cp(X,Rt\n\ni. Then it holds that\n\ni) + Cp(X t, OPTt)).\n\ni, St\n\ni, St\n\n4 Empirical Analysis\n\nIn this section, we evaluate our algorithm for facility location experimentally. Recall that we aim to\nstrike a balance between (1) overall running time, (2) the cost of the solution, (3) the total recourse.\nOur implementation follows the framework outline in Theorem 2.1. As part of the recomputation\nstep between two periods, we run 5 independent executions of Meyerson\u2019s algorithm, and selecting\nthe execution with lowest cost. The updates within a period are handled by assigning to closest\ncenter if distance is less than f or otherwise open a new center at the point, and we simply remove\na client if it gets deleted. We compare our algorithm against two variants of Meyerson\u2019s algorithm.\n\n7\n\n\fAlgorithm 1 Fully Dynamic (k, p)-clustering\nInput: For all t, X t be the set of points at time t\nOutput: For all t, a set St of centers at time t and an mapping \u03c6t : X t \u2192 St of points to St\n1: Let Rt\n\ni be the coreset at time t for Li, and \u03c6t\n\ni the function mapping every point of X t to its\n\nclosest point in Rt\n\ni\n\n2: Let t0 be the last time M eyersonCapped was called\n\n3: for all time t do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\nelse\n\nif a point xt is inserted then\n\ni = Rt\u22121\n\nLet Rt\nfor all i \u2190 1, ..., log n do\n\nelse if a point xt is removed then\n\ni \u222a {xt},\u2200i \u2208 [log n]\n\ni\n\nthen\n\nif xt /\u2208 Rt\u22121\nRt \u2190 Rt\u22121\nLet z \u2190 \u03c6t0(xt)\nLet z(cid:48) \u2208 X t be the closest to z\n(breaking ties arbitrarily)\nRt \u2190 Rt\u22121 \u222a {z(cid:48)}\n{ti\nl is the last time M eyersonCapped\nwas invoked on Rt\u22121}\n\nCall DeleteP oint(Li, ti\n\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n\n27:\n\n28:\n\nif t is a multiple of k then\n\n{Recompute from scratch all Ri}\nt0 \u2190 t\nfor all i \u2190 1, ..., log n do\ni \u2190M eyersonCapped(Li, X t)\nRt\n\nend for\n\ni and\nCp(Ci,Rt\ni)\n\nend if\n{Keep the best among all Rt\nassignments \u03c6t}\n(i, St) \u2190 arg min\ni,Ci=A(Rt\ni)\nLet \u03c8 : Rt\ni \u2192 Ci be the assignment\n\u03c6t \u2190 \u03c8 \u25e6 \u03c6t\n\ncomputed by A\n\ni\n\n13:\n14:\n\n15:\n16:\n17:\n18:\n\nend if\nend for\n\nend if\n\nl,Rt\u22121)\n\n29:\n30: end for\n\nThe \ufb01rst one, termed MeyersonRec, re-runs Meyerson at every single update. The second, termed\nMeyersonSingle, consists of a single execution of Meyerson for all updates, where deletions are\nhandled by just removing the distance cost of the deleted point. Following Hubert-Chan et al. [24],\nwe incorporate deletions by considering a sliding window over the data set. A point is inserted/deleted\nwhen it enters/exists the window, respectively.\n\nData Set and Setup. We consider the following data sets, equipped with the Euclidean distance.\n\u2022 The Twitter data set [23], considered by [24], consists of 4.5 million geotagged tweets in 3\ndimensions (longitude, latitude, timestamp). We restricted our experiments to the \ufb01rst 200K tweets.\n\u2022 The COVERTYPE data set [5], considered by [30], from the UCI repository with 581K points and\n54 dimensions. We restricted our experiments to the \ufb01rst 100K points and 10 dimensions (the ones\nwe believed to be appropriate for an Euclidan metric).\n\u2022 The USCensus1990 data set [32] from the UCI repository has 69 dimensions and 2.5 million points.\nWe restricted our experiments to the \ufb01rst 30K points.\nWe restricted the number of points considered due to time constraints. Since larger data sets typically\nhave more complicated ground truths, we used a larger windows for them containing more samples.\nTo avoid over\ufb01tting, we also adjusted the cost of opening a facility depending on the window size,\ni.e. for larger windows a lower opening cost per facility. For COVERTYPE and USCensus1990, we\nused a window size of 5000 points and a facility cost of 0.5; for Twitter, the window size was 10000\nand the facility cost 0.004. All our codes are written in Python. The experiments were executed on a\nWindows 10 machine with processor: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz, 3701 Mhz, 6\ncores, 12 Logical processors, and 16 GB RAM.\n\nResults.\nIn all three data sets we generally observed the same behavior in terms of running time,\ncost, and the number of clusters opened, see Figure 2. Our algorithm is 100 times faster than\nMeyersonRec. Compared to MeyersonSingle, our algorithm is slower initially. When the number of\nprocessed points becomes very large, the running time of MeyersonSingle deteriorates comparatively,\nas it never removes a facility once it has been opened: the time to compute the distance to the\n\n8\n\n\f(a) Running time, Twitter.\n\n(b) Cost, Twitter.\n\n(c) Recourse, Twitter.\n\n(d) Running time, covertype.\n\n(e) Cost, covertype.\n\n(f) Recourse, covertype.\n\n(g) Running time, USCensus1990.\nFigure 2: A comparison of the algorithms we consider in terms of running time (left column), cost of\nthe solution (middle column), and recourse (right column).\n\n(i) Recourse, USCensus1990.\n\n(h) Cost, USCensus1990.\n\nset of facilities is therefore increasing (see Figure 1 in the supplementary material). The cost of\nMeyersonSingle generally has a linear dependency on the number of updates, though the slope is\nvery gentle. This is also what our algorithm takes advantage off, broadly speaking by approximating\nthe curve with a step function (adapted to handle insertions and deletions). The cost of our algorithm\nand MeyersonRec is basically indistinguishable, and in certain cases our algorithm fares even slightly\nbetter. The recourse of our algorithm is expectedly much better than MeyersonRec by a wide margin,\nand signi\ufb01cantly worse than MeyersonSingle.\nFinally, we ran our algorithm with multiple choices of facility cost f, and we observed that the\nrecourse is almost independent of the both cost and running time of the algorithm, and only depends\non the number of updates. This is consistent with tracking evolving data in time, where the underlying\nground truth clustering also evolves in time.\n\nAcknowledgements. Nikos Parotsidis is supported by Grant Number 16582, Basic Algorithms\nResearch Copenhagen (BARC), from the VILLUM Foundation. Ce projet a b\u00b4en\u00b4e\ufb01ci\u00b4e d\u2019une aide\nde l\u2019 \u00b4Etat g\u00b4er\u00b4ee par l\u2019Agence Nationale de la Recherche au titre du Programme FOCAL portant la\nr\u00b4ef\u00b4erence suivante : ANR-18-CE40-0004-01.\n\nReferences\n[1] S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and\neuclidean k-median by primal-dual algorithms. In 2017 IEEE 58th Annual Symposium on\nFoundations of Computer Science (FOCS), pages 61\u201372, Oct 2017.\n\n[2] A. Anagnostopoulos, R. Bent, E. Upfal, and P. V. Hentenryck. A simple and deterministic\n\ncompetitive algorithm for online facility location. Inf. Comput., 194(2):175\u2013202, 2004.\n\n9\n\n\f[3] V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local search\nheuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544\u2013562, 2004.\n\n[4] O. Bachem, M. Lucic, and A. Krause. Distributed and provably good seedings for k-means in\nconstant rounds. In Proceedings of the 34th International Conference on Machine Learning\n(ICML), pages 292\u2013300, 2017.\n\n[5] J. A. Blackard, D. J. Dean, and C. W. Anderson. Covertype data set, https://archive.ics.\n\nuci.edu/ml/datasets/covertype.\n\n[6] V. Braverman, G. Frahling, H. Lang, C. Sohler, and L. F. Yang. Clustering high dimensional\nIn Proceedings of the 34th International Conference on Machine\n\ndynamic data streams.\nLearning (ICML), pages 576\u2013585, 2017.\n\n[7] V. Braverman, H. Lang, K. Levin, and M. Monemizadeh. Clustering problems on sliding\nwindows. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete\nalgorithms (SODA), pages 1374\u20131390, 2016.\n\n[8] V. Braverman, A. Meyerson, R. Ostrovsky, A. Roytman, M. Shindler, and B. Tagiku. Streaming\nk-means on well-clusterable data. In Proceedings of the Twenty-Second Annual ACM-SIAM\nSymposium on Discrete Algorithms (SODA), pages 26\u201340, 2011.\n\n[9] M. Charikar, C. Chekuri, T. Feder, and R. Motwani.\n\nIncremental clustering and dynamic\n\ninformation retrieval. SIAM J. Comput., 33(6):1417\u20131440, 2004.\n\n[10] M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and\nk-median problems. In 40th Annual Symposium on Foundations of Computer Science, (FOCS),\npages 378\u2013388, 1999.\n\n[11] M. Charikar, L. O\u2019Callaghan, and R. Panigrahy. Better streaming algorithms for clustering\nproblems. In Proceedings of the Thirty-\ufb01fth Annual ACM Symposium on Theory of Computing\n(STOC), pages 30\u201339, 2003.\n\n[12] M. Cygan, A. Czumaj, M. Mucha, and P. Sankowski. Online facility location with deletions. In\n\n26th Annual European Symposium on Algorithms (ESA), pages 21:1\u201321:15, 2018.\n\n[13] A. Czumaj, C. Lammersen, M. Monemizadeh, and C. Sohler. (1+\u0001)-approximation for facility\nlocation in data streams. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium\non Discrete Algorithms (SODA), pages 1710\u20131728, 2013.\n\n[14] D. Fotakis. Incremental algorithms for facility location and k-median. Theor. Comput. Sci.,\n\n361(2-3):275\u2013313, 2006.\n\n[15] D. Fotakis. A primal-dual algorithm for online non-uniform facility location. J. Discrete\n\nAlgorithms, 5(1):141\u2013148, 2007.\n\n[16] D. Fotakis. On the competitive ratio for online facility location. Algorithmica, 50(1):1\u201357,\n\n2008.\n\n[17] G. Frahling and C. Sohler. Coresets in dynamic geometric data streams. In Proceedings of the\n\n37th Annual ACM Symposium on Theory of Computing (STOC), pages 209\u2013217, 2005.\n\n[18] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical\n\nComputer Science, 38:293 \u2013 306, 1985.\n\n[19] G. Goranci, M. Henzinger, and D. Leniowski. A tree structure for dynamic facility location. In\n\n26th Annual European Symposium on Algorithms (ESA), pages 39:1\u201339:13, 2018.\n\n[20] S. Guha and S. Khuller. Greedy strikes back: Improved facility location algorithms. J.\n\nAlgorithms, 31(1):228\u2013248, 1999.\n\n[21] A. Gupta and K. Tangwongsan. Simpler analyses of local search algorithms for facility location.\n\nCoRR, abs/0809.2554, 2008.\n\n10\n\n\f[22] M. Henzinger, D. Leniowski, and C. Mathieu. Dynamic clustering to minimize the sum of radii.\n\nIn 25th Annual European Symposium on Algorithms (ESA), pages 48:1\u201348:10, 2017.\n\n[23] T. Hubert Chan, A. Guerqin, and M. Sozio. Twitter data set, https://github.com/\n\nfe6Bc5R4JvLkFkSeExHM/k-center.\n\n[24] T. Hubert Chan, A. Guerqin, and M. Sozio. Fully dynamic k-center clustering. In Proceedings\n\nof the 2018 World Wide Web Conference on World Wide Web (WWW), pages 579\u2013587, 2018.\n\n[25] P. Indyk. Algorithms for dynamic geometric problems over data streams. In Proceedings of the\n\n36th Annual ACM Symposium on Theory of Computing (STOC), pages 373\u2013380, 2004.\n\n[26] K. Jain and V. V. Vazirani. Approximation algorithms for metric facility location and k-median\nproblems using the primal-dual schema and lagrangian relaxation. J. ACM, 48(2):274\u2013296,\n2001.\n\n[27] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local\nsearch approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89\u2013112, 2004.\n\n[28] C. Lammersen and C. Sohler. Facility location in dynamic geometric data streams. In Proceed-\n\nings of the 16th Annual European Symposium (ESA), pages 660\u2013671, 2008.\n\n[29] H. Lang. Online facility location against a t-bounded adversary. In Proceedings of the Twenty-\nNinth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1002\u20131014, 2018.\n\n[30] S. Lattanzi and S. Vassilvitskii. Consistent k-clustering. In Proceedings of the 34th International\n\nConference on Machine Learning (ICML), pages 1975\u20131984, 2017.\n\n[31] S. Li. A 1.488 approximation algorithm for the uncapacitated facility location problem. Infor-\n\nmation and Computation, 222:45 \u2013 58, 2013.\n\n[32] C. Meek, B. Thiesson, and D. Heckerman. Us census data (1990), http://archive.ics.\n\nuci.edu/ml/datasets/US+Census+Data+(1990).\n\n[33] R. R. Mettu and C. G. Plaxton. The online median problem. SIAM J. Comput., 32(3):816\u2013832,\n\n2003.\n\n[34] R. R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Machine\n\nLearning, 56(1):35\u201360, Jul 2004.\n\n[35] A. Meyerson. Online facility location. In 42nd Annual Symposium on Foundations of Computer\n\nScience (FOCS), pages 426\u2013431, 2001.\n\n[36] A. Munteanu and C. Schwiegelshohn. Coresets-methods and history: A theoreticians design\n\npattern for approximation and streaming algorithms. KI, 32(1):37\u201353, 2018.\n\n[37] M. Shindler, A. Wong, and A. Meyerson. Fast and accurate k-means for large datasets. In\nProceeding of the Twenty-\ufb01fth Conference on Neural Information Processing Systems (NIPS),\npages 2375\u20132383, 2011.\n\n11\n\n\f", "award": [], "sourceid": 1827, "authors": [{"given_name": "Vincent", "family_name": "Cohen-Addad", "institution": "CNRS & Sorbonne Universit\u00e9"}, {"given_name": "Niklas Oskar", "family_name": "Hjuler", "institution": "University of Copenhagen"}, {"given_name": "Nikos", "family_name": "Parotsidis", "institution": "University of Copenhagen"}, {"given_name": "David", "family_name": "Saulpic", "institution": "Ecole normale sup\u00e9rieure"}, {"given_name": "Chris", "family_name": "Schwiegelshohn", "institution": "Sapienza, University of Rome"}]}