Coreset for Line-Sets Clustering

Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track

Bibtex Paper Supplemental

Authors

Sagi Lotan, Ernesto Evgeniy Sanches Shayda, Dan Feldman

Abstract

The input to the {line-sets $k$-median} problem is an integer $k \geq 1$, and a set $\mathcal{L} = \{L_1,\dots,L_n\}$that contains $n$ sets of lines in $\mathbb{R}^d$. The goal is to compute a set $C$ of $k$ centers (points in $\mathbb{R}^d$) that minimizes the sum $\sum_{L \in \mathcal{L}}\min_{\ell\in L, c\in C}\mathrm{dist}(\ell,c)$ of Euclidean distances from each set to its closest center, where $\mathrm{dist}(\ell,c):=\min_{x\in \ell}\norm{x-c}_2$.An \emph{$\varepsilon$-coreset} for this problem is a weighted subset of sets in $\mathcal{L}$ that approximates this sum up to $1 \pm \varepsilon$ multiplicative factor, for every set $C$ of $k$ centers. We prove that \emph{every} such input set $\set{L}$ has a small $\varepsilon$-coreset, and provide the first coreset construction for this problem and its variants. The coreset consists of $O(\log^2n)$ weighted line-sets from $\set{L}$, and is constructed in $O(n\log n)$ time for every fixed $d, k\geq 1$ and $\varepsilon \in (0,1)$. The main technique is based on a novel reduction to a ``fair clustering'' of colored points to colored centers. We then provide a coreset for this coloring problem, which may be of independent interest. Open source code and experiments are also provided.