Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track
Yichuan Deng, Zhao Song, OMRI WEINSTEIN, Ruizhe Zhang
In the \emph{Distance Oracle} problem, the goal is to preprocess n vectors x1,x2,⋯,xn in a d-dimensional normed space (Xd,‖ into a cheap data structure, so that given a query vector q \in \mathbb{X}^d, all distances \| q - x_i \|_l to the data points \{x_i\}_{i\in [n]} can be quickly approximated (faster than the trivial \sim nd query time). This primitive is a basic subroutine in machine learning, data mining and similarity search applications. In the case of \ell_p norms, the problem is well understood, and optimal data structures are known for most values of p. Our main contribution is a fast (1\pm \varepsilon) distance oracle for \emph{any symmetric} norm \|\cdot\|_l. This class includes \ell_p norms and Orlicz norms as special cases, as well as other norms used in practice, e.g. top-k norms, max-mixture and sum-mixture of \ell_p norms, small-support norms and the box-norm. We propose a novel data structure with \tilde{O}(n (d + \mathrm{mmc}(l)^2 ) ) preprocessing time and space, and t_q = \tilde{O}(d + n \cdot \mathrm{mmc}(l)^2) query time, where \mathrm{mmc}(l) is a complexity-measure (modulus) of the symmetric norm under consideration. When l = \ell_{p} , this runtime matches the aforementioned state-of-art oracles.