{"title": "k-Means Clustering of Lines for Big Data", "book": "Advances in Neural Information Processing Systems", "page_first": 12817, "page_last": 12826, "abstract": "The input to the \\emph{$k$-mean for lines} problem is a set $L$ of $n$ lines in $\\mathbb{R}^d$, and the goal is to compute\na set of $k$ centers (points) in $\\mathbb{R}^d$ that minimizes the sum of squared distances over every line in $L$ and its nearest center. This is a straightforward generalization of the $k$-mean problem where the input is a set of $n$ points instead of lines.\n\nWe suggest the first PTAS that computes a $(1+\\epsilon)$-approximation to this problem in time $O(n \\log n)$ for any constant approximation error $\\epsilon \\in (0, 1)$, and constant integers $k, d \\geq 1$. This is by proving that there is always a weighted subset (called coreset) of $dk^{O(k)}\\log (n)/\\epsilon^2$ lines in $L$ that approximates the sum of squared distances from $L$ to \\emph{any} given set of $k$ points. \n\nUsing traditional merge-and-reduce technique, this coreset implies results for a streaming set (possibly infinite) of lines to $M$ machines in one pass (e.g. cloud) using memory, update time and communication that is near-logarithmic in $n$, as well as deletion of any line but using linear space. These results generalized for other distance functions such as $k$-median (sum of distances) or ignoring farthest $m$ lines from the given centers to handle outliers.\n\nExperimental results on 10 machines on Amazon EC2 cloud show that the algorithm performs well in practice.\nOpen source code for all the algorithms and experiments is also provided.", "full_text": "k-Means Clustering of Lines for Big Data\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nYair Marom\n\nUniversity of Haifa\n\nHaifa, Israel\n\nDan Feldman\n\nUniversity of Haifa\n\nHaifa, Israel\n\nyairmrm@gmail.com\n\ndannyf.post@gmail.com\n\nAbstract\n\nThe input to the k-mean for lines problem is a set L of n lines in Rd, and the goal\nis to compute a set of k centers (points) in Rd that minimizes the sum of squared\ndistances over every line in L and its nearest center. This is a straightforward\ngeneralization of the k-mean problem where the input is a set of n points instead\nof lines.\nWe suggest the \ufb01rst PTAS that computes a (1 + \u03b5)-approximation to this problem\nin time O(n log n) for any constant approximation error \u03b5 \u2208 (0, 1), and constant\nintegers k, d \u2265 1. This is by proving that there is always a weighted subset (called\ncoreset) of dkO(k) log(n)/\u03b52 lines in L that approximates the sum of squared\ndistances from L to any given set of k points.\nUsing traditional merge-and-reduce technique, this coreset implies results for a\nstreaming set (possibly in\ufb01nite) of lines to M machines in one pass (e.g. cloud)\nusing memory, update time and communication that is near-logarithmic in n, as\nwell as deletion of any line but using linear space. These results generalized for\nother distance functions such as k-median (sum of distances) or ignoring farthest\nm lines from the given centers to handle outliers.\nExperimental results on 10 machines on Amazon EC2 cloud show that the\nalgorithm performs well in practice. Open source code for all the algorithms and\nexperiments is also provided.\n\n1\n\nIntroduction\n\n1.1 Background\n\nClustering is the task of partitioning the input set to subsets, where items in the same subset (cluster)\nare similar to each other, compared to items in other clusters. There are many different clustering\ntechniques, but arguably the most common in both industry and academy is the k-mean problem,\nwhere the input is a set P of n points in Rd, and the goal is to compute a set C of k centers (points) in\nRd, that minimizes the sum of squared distances over each point p \u2208 P to its nearest center in C, i.e.\n\nC \u2208 arg min\n\nC(cid:48)\u2286Rd,|C(cid:48)|=k\n\nc(cid:48)\u2208C(cid:48) (cid:107)p \u2212 c(cid:48)(cid:107)2 .\n\nmin\n\n(cid:88)\n\np\u2208P\n\nA very common heuristics to solve this problem is the Lloyd\u2019s algorithm [3, 22], that is similar to the\nEM-Algorithm that is described in Section 5 in supplementary material [2]. We consider a natural\ngeneralization of this k-mean problem, where the input set P of n points is replaced by a set L of n\nlines in Rd; See Fig. 2. Here, the distance from a line to a center c is the closest Euclidean distance\nto c over all the points on the line. Since we only assume the \u201cweak\" triangle inequality between\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fpoints, our solution can easily be generalized to sum of distances to the power of any constant z \u2265 1\nas explained e.g. in [7, 6].\nMotivation for solving the k-line mean problem arises in many different \ufb01elds, when there is some\nmissing entry in all or some of the input vectors, or incomplete information such as a missing sensor.\nFor example, a common problem in Computer Vision is to compute the position of a point or k points\nin the world, based on their projections on a set of n 2D-images, which turn into n lines via the\npinhole camera model; See Fig 1, and [18, 27] for surveys.\nIn Data Science and matrix approximation theory, every missing entry turns a point (database\u2019s\nrecord) into a line by considering all the possible values for the missing entry. E.g., n points on the\nplane from k-mean clusters would turn into n horizontal/vertical lines that intersect \"around\" the\nk-mean centers. The resulting problem is then k-mean for n lines [24, 25]. One can consider also an\napplications to semi-supervised learning - k-mean for mixed points and lines. This problem arises\nwhen lines are unlabeled points (last axis is a label) and we want to add a label to the farthest lines\nfrom the points.\n\nFigure 1: Application of k-line mean for computer vision. Given a drone (or any other rigid body) that\nis captured by n cameras - our goal is to locate the 3 dimensional position of the drone in space by\nidentifying k = 4 \ufb01xed known markers/features on the drone. Each point on each image corresponds\nto a line that passes through it and the pin-hole of the camera. Without noise, all the lines intersect at\nthe same point (marker in R3). Otherwise their 1-mean is a natural approximation.\n\n1.2 Related Work\n\nThe k-mean problem and its variance has been researched in numerous papers over the recent decades,\nespecially in the machine learning community , see [16, 20, 28, 5] and references therein. There\nare also many results regarding projective clustering, when the k centers are replaced by lines or\nj-dimensional subspaces instead of points.\nHowever, signi\ufb01cantly less results are known for the case of clustering subspaces, or even lines. A\npossible reason might be to the fact that the triangle inequality or its weaker version holds for a\npair of points but not for lines, even in the planar case: two parallel lines can have an arbitrarily\nlarge distance from each other, but still intersect with a third line simultaneously. Gao, Langebreg\nand Schulman [12] used Helly\u2019s theorem [8] (intersection of convex sets) to introduce the \"k-center\nproblem\" for lines, that aims to cover a collection of lines by the smallest k balls in R3.\nLangebreg and Schulman [11] addressed the 1-convex sets center problem that aims to compute a\nball that intersects a given set of n convex sets such as lines and \u2206-af\ufb01ne subspaces. This type of\nnon-clustering problems (k = 1) is easier since it admits a convex optimization problem instead of a\nclustering non-convex problem.\nUnlike the case of numerous theoretical papers that study the k-mean problem for points, we did not\n\ufb01nd any provable solution for the case of k-line mean problem or even an ef\ufb01cient PTAS (polynomial\napproximation scheme). There are very few related results that we give here. A solution for the\nspecial case of d = 2 and sum of distances was considered in [23]. Lee and Schulman [17] studied\nthe generalization of the k-center problem (maximum over the n distances, instead of their sum), for\nthe case where the input is a set of n af\ufb01ne subspaces, each of dimension \u2206. In this case, the size of\n\n2\n\n\fFigure 2: Problem Statement demonstration for the planar case. The input is a set L = {(cid:96)1, . . . , (cid:96)6}\nof n = 6 lines in R2, and our goal is to \ufb01nd the k = 2 points (centers) p1, p2 \u2208 R2 that minimize the\nsum of Euclidean distances from each center to its nearest line in L. See Section 2.1 de\ufb01nition of the\ndistance function D.\n\nthe coreset is exponential in d, which was proved to be unavoidable even for a coreset of 1-center\n(single point) for this type of covering problems. The corresponding covering problems can then be\nsolved using traditional computational geometry techniques.\nTable 1 summarizes the above results.\n\nProblem\n\n1-line center in Rd\n2-line centers in R2\n3-line centers in R2\n1-center for convex-\nsets in Rd\nk-line median in R3\nk-line median in R2\n1-\u2206-\ufb02ats center in Rd\n2-\u2206-\ufb02ats centers in Rd\n3-\u2206-\ufb02ats centers in Rd\nk-\u2206-\ufb02ats centers in Rd\nk-line median in Rd\n\n(cid:16)\n\nO\n\nRunning Time\nnd(1/\u03b5)O(1)\n\nO (n log (1/\u03b5) (d + log n))\nnd log(1/\u03b5) + n log2(n) log(1/\u03b5)\n\n\u03b5\n\niterative, unbounded\n\nO(cid:0)n\u2206+1d(1/\u03b5)O(1)(cid:1)\n(cid:16) log n\n(cid:17)O(k)\n(cid:1)\nO(cid:0) nd\u2206\nO(cid:0)dn2 log n(cid:1)\n\nlog \u2206\n\u03b5\n\nn\n\n\u03b5\n\n\u03b52\n\n(cid:17)\n\nApprox.\nFactor\n1 + \u03b5\n\n2 + \u03b5\n\n2 + \u03b5\n\n1 + \u03b5\n\nunbounded\n\n1 + \u03b5\n\nO(cid:0)\u22061/4(cid:1)\n\n1 + \u03b5\n\n2O(\u2206(1+1/\u03b52))nd\n\nO(cid:0)d3n log(n)k log k + (d/\u03b5)2(cid:1) + ndkO(k)\n\n2O(\u2206k log k(1+1/\u03b52))nd\n\n1 + \u03b5\n\n1 + \u03b5\n\n1 + \u03b5\n\nPaper\n[11]\n[12]\n\n[12]\n\n[11]\n[21]\n\n[23]\n[17]\n[17]\n[17]\n[17]\nOur\n\nTable 1: Summary of related results for k centers of n points in Rd. The dimension of the quarried\nsubspaces is denoted by \u2206 and the error rate is by \u03b5.\n\n1.3 Main Contribution\nOur main technical result is an algorithm that gets a set L of n lines in Rd, an integer k \u2265 1,\nand computes an \u03b5-coreset (see De\ufb01nition 2.7) of size dkO(k) log(n)/\u03b52 for L and every error\nparameter \u03b5 > 0, in a near-linear running time in the number of data lines n, and polynomial in the\ndimensionality d and the number k of desired centers; See Theorem 2.8 for details and exact bounds.\nUsing this coreset with a merge-and-reduce technique, we achieve the following results:\n\n1. An algorithm that, during one pass, maintains and outputs a (1 + \u03b5)-approximation to the\n\nk-line mean of the lines seen so far; See De\ufb01nition 2.4.\n\n3\n\n\u21131\u21132\u21133\u21134\u21135\u21136\ud835\udc5d1\ud835\udc5d2\ud835\udc37(\u21135,\ud835\udc5d1)\f2. A streaming algorithm that computes an (1+\u03b5)-approximation for the k-line mean of a set L\nof lines that may be distributed (partitioned) among M machines, where each machine needs\nto send only d3kO(k) log2 n input lines to the main server at the end of its computation.\n\n3. Experimental results on 10 machines on Amazon EC2 Cloud [9] show that the algorithm\n\nperforms well in practice, boost the performance of existing EM-heuristic [19].\n\nWhy the coreset is exponential in k? The worst case size of our coreset is polynomial in log n but\nexponential in k. Still, it is tight: such lower bounds are known for weighted centers [14] which is a\nspecial case of coreset for lines, as explained in [10]. Fortunately, it seems that it is only a theoretical\nworst-case bound for a very tailored and arti\ufb01cial synthetic example. Our experiments imply that\nthe error on real-world data sets is far better than our theorem predicts using the same importance\ndistribution but for a smaller sample size. In addition, we can partition the n-line set in the hope that\neach subset would be served by at most k points.\n\n2 Problem Statement and Theoretical Result\n\n2.1 Preliminaries and Problem Statement\nFor an integer n \u2265 1 we de\ufb01ne [n] = {1, . . . , n}. From here and in the rest of the paper, we\nassume that we are given a function D : Rd \u00d7 Rd \u2192 R and a constant \u03c1 > 0 such that D(a, b) \u2264\n\u03c1(D(a, c) + D(c, b)) for every a, b, c \u2208 Rd.\nDe\ufb01nition 2.1 (weighted set) A weighted set of lines is a pair L(cid:48) = (L, w) where L is a set of lines\nin Rd, and w : L \u2192 (0,\u221e) is a function that maps every (cid:96) \u2208 L to w((cid:96)) \u2265 0, called the weight of (cid:96).\nA weighted set (L, 1) where 1 is the weight function w : L \u2192 {1} that assigns w((cid:96)) = 1 for every\n(cid:96) \u2208 L may be denoted by L for short.\nDe\ufb01nition 2.2 (distance) The Euclidean distance between a pair of points is denoted by the function\nD : Rd \u00d7 Rd \u2192 [0,\u221e), s.t. for every x, y \u2208 Rd we have D(x, y) = (cid:107)x \u2212 y(cid:107)2. For every set\nX \u2286 Rd and a point x \u2208 Rd, we de\ufb01ne the distance from X to x by D(X, x) = inf q\u2208X D(q, x), and\nfor every set Y \u2286 Rd, we denote the distance from X to Y by D(X, Y ) = inf (x,y)\u2208X\u00d7Y D(x, y).\nDe\ufb01nition 2.3 (cost) For every set P \u2286 Rd of k points and a weighted set of lines L(cid:48) = (L, w) in\n\nRd, we denote the sum of weighted distances from L to P by cost(L(cid:48), P ) =(cid:80)\n\n(cid:96)\u2208L w((cid:96))D((cid:96), P ).\n\nA natural generalization of the k-mean problem is to replace the input set of points P by a set L of n\nlines in Rd.\nDe\ufb01nition 2.4 (k-mean for lines) Let L(cid:48) = (L(cid:48), w) be a weighted set of lines in Rd and k \u2265 1 be\nan integer. A set P \u2217 \u2286 Rd is a k-mean of L(cid:48) if it minimizes cost(L(cid:48), P ) over every set P of k points\nin Rd.\n\n2.2 Theoretical Results\nWe begin with our \ufb01rst result - the \ufb01rst bicriteria approximation for the general k-mean of lines in Rd,\nfor any integers k, d \u2265 1.\nDe\ufb01nition 2.5 (\u03b1, \u03b2-approximation) Let L be a \ufb01nite set of lines in Rd, k \u2265 1 be an integer and\nP \u2217 \u2286 Rd be a k-mean of L; See De\ufb01nition 2.4. Then, for every \u03b1, \u03b2 \u2265 0, a set B \u2286 Rd of k\u03b2 points\nis called (\u03b1, \u03b2)-approximation of L, if\n\ncost(L, B) \u2264 \u03b1 \u00b7 cost(L, P \u2217).\n\nIf \u03b2 = 1 then B is called an \u03b1-approximation of L. , if \u03b1 = \u03b2 = 1 then B is called the optimal\nsolution.\nTheorem 2.6 Let L be a set of n lines in Rd, k \u2265 1 be an integer, \u03b4 \u2208 (0, 1) and\n\n(cid:18)\n\nm \u2265 c\n\n(cid:18) 1\n\n(cid:19)(cid:19)\n\n,\n\n\u03b4\n\ndk log2 k + log2\n\n4\n\n\ffor a suf\ufb01ciently large constant c > 1 that can be determined from the proof. Let B be the output set\nof a call to BI-CRITERIA-APPROXIMATION(L, m); See Algorithm 2. Then,\n\n|B| \u2208 O (log n (dk log k + log(1/\u03b4)))\n\n(1)\n\nand with probability at least 1 \u2212 \u03b4,\n\nMoreover, B can by computed in O(cid:0)nd2k log k + m2 log n(cid:1) time.\n\ncost(L, B) \u2264 4\u03c12 \u00b7 min\n\nP\u2286Rd,|P|=k\n\ncost(L, P ).\n\nCoreset is a problem dependent data summarization. The de\ufb01nition of coreset is not consistent among\npapers. In this paper, the input is usually a set of lines in Rd, but for the streaming case we compute\ncoreset for union of (weighted) coresets and thus weights will be needed. We use the following\nde\ufb01nition of Feldman and K\ufb01r [15].\nDe\ufb01nition 2.7 (\u03b5-coreset [15]) For an approximation error \u03b5 > 0, the weighted set S(cid:48) = (S, u) is\ncalled an \u03b5-coreset for the query space (P (cid:48), Y, f, loss), if S \u2286 P, u : S \u2192 [0,\u221e), and for every\ny \u2208 Y we have\n\n(1 \u2212 \u03b5)floss(P (cid:48), y) \u2264 floss(S(cid:48), y) \u2264 (1 + \u03b5)floss(P (cid:48), y).\n\nTheorem 2.8 (coreset for k-line mean) Let L be a set of n lines in Rd, k \u2265 1 be an integer, \u03b5, \u03b4 \u2208\n(0, 1) and m > 1 be an integer such that\n\nm \u2265 cd2k log2(k) log2(n) + log(1/\u03b4)\n\nfor some universal constant c > 0, and Qk = (cid:8)B \u2286 Rd | |B| = k(cid:9). Let (S, u) be the output of\n\na call to CORESET(L, k, m); see Algorithm 4. Then, with probability at least 1 \u2212 \u03b4, (S, u) is an\n\u03b5-coreset for the query space (L,Qk, D,(cid:107)\u00b7(cid:107)1). Moreover, (S, u) can be computed in time\n\n\u03b52\n\n,\n\nO(cid:0)d3n log(n)k log k + (d/\u03b5)2(cid:1) + ndkO(k).\n\nUsing K\ufb01r and Feldman coreset on streaming framework [15], we enable to boost performance and\ninstead of performing the coreset calculation on the whole data in one piece - we performed the\naction on batches we read one after the other (the streaming version), and its validity can be seen in\nthe carried out experiments in Section 4.\n\n3 Algorithms and Technique\n\nWe present here the \ufb01rst bi-criteria solution for the k-line mean problem, where Alg. 1 is a sub-\nprocedure that is being called during the running of Alg. 2.\n\nAlgorithm 1: CENTROID-SET(L)\n\nA \ufb01nite set L of n lines in Rd.\nA set G \u2286 Rd of O(n2) points.\n\nInput:\nOutput:\n1 for every (cid:96) \u2208 L do\n2\n3\n\nfor every (cid:96)(cid:48) \u2208 L \\ (cid:96) do\n\nCompute q((cid:96), (cid:96)(cid:48)) \u2208 arg minx\u2208(cid:96) D((cid:96)(cid:48), x)\n\n// the closest point on (cid:96) to (cid:96)(cid:48).\n\nQ((cid:96)) := {q((cid:96), (cid:96)(cid:48)) | (cid:96)(cid:48) \u2208 L \\ {(cid:96)}}\n\nG :=(cid:83)\n\n4\n5\n6 return G\n\n(cid:96)\u2208L Q((cid:96))\n\nTies broken arbitrarily.\n\nOverview of Algorithm 2. The input to the algorithm is a set L consist of n lines in Rd and a\npositive integer m \u2265 1. In each iteration of the algorithm it picks a small uniform sample S of the\ninput in Line 3, compute their centroid-set G using a call to Algorithm 1 in Line 4, and add them to\nthe output set B in Line 5. Then, in Line 6, a constant fraction of the closest lines to G is removed\n\n5\n\n\f1 X := L, B := \u2205\n2 while |X| > 100 do\n3\n\nfrom the input set L. The algorithm then continues recursively for the next iteration, but only on the\nremaining set of lines until almost no more lines remain. The output is the resulting set B.\n\nAlgorithm 2: BI-CRITERIA-APPROXIMATION(L, m)\n\nInput:\nOutput:\n\nA set L of n lines in Rd, and an integer m \u2265 1.\nA set B \u2286 Rd which is, with probability at least 1/2, an (\u03b1, \u03b2)-approximation\n\nfor the k-mean of L, where \u03b1 \u2208 O(1) and \u03b2 = O(cid:0)m2 log n(cid:1).\n\n// See Definition 2.5 and Theorem 2.6.\n\nPick a sample S of |S| \u2265 m lines, where each line (cid:96) \u2208 S is sampled i.i.d and uniformly\nat random from X.\nG := CENTROID-SET(S)\nB := B \u222a G\nX(cid:48) := the closest 7|X| /11 lines in X to G. Ties broken arbitrarily.\nX := X \\ X(cid:48)\n\n4\n5\n6\n7\n8 return B\nOverview of Algorithm 3. The input is a set L of lines in Rd, a point b \u2208 Rd and an integer\nk \u2265 1 for the number of desired centers. This procedure is called from Algorithm 4, where b is an\napproximation to the 1-mean of L. The output function s maps every line (cid:96) \u2208 L to [0,\u221e), and is\nbeing used during Algorithm 4 execution. We de\ufb01ne in Line 1 a unit sphere Sd\u22121 that is centered\naround b. Next, in Line 3 for each line (cid:96) \u2208 L we de\ufb01ne (cid:96)(cid:48) \u2286 Rd to be the translation of (cid:96) to b. In Lines\n4\u20135, we replace every line (cid:96)(cid:48) with one of its two intersections with Sd\u22121, and de\ufb01ne the union of these\npoints to be the set Q. In Line 6 we call the sub-procedure WEIGHTED-CENTERS-SENSITIVITY\nthat is described in [10]. This procedure returns the sensitivities of the query space of k-weighted\ncenters queries on Q. As stated in Lemma 32 in supplementary material, the total sensitivities of this\ncoreset is kO(k) log n. Finally, in Line 7, we convert the output sensitivity s(p) of each point p in Q\nto the output sensitivity s((cid:96)) of the corresponding line (cid:96) in L.\n\nAlgorithm 3: LINES-SENSITIVITY(L, b, k)\n\nInput:\nOutput:\n\n1 Sd\u22121 :=(cid:8)x \u2208 Rd | (cid:107)x \u2212 b(cid:107) = 1(cid:9) // the unit sphere that is centered at b.\n\nA set L of n lines in Rd, a point b \u2208 Rd and integer k \u2265 1.\nA (sensitivity) function s : L \u2192 [0,\u221e).\n\n(cid:96)(cid:48) := the line {x \u2212 b | x \u2208 (cid:96)} that is parallel to (cid:96) and intersects b // see Fig. 3.\np((cid:96)(cid:48)) := an arbitrary point in the pair (cid:96)(cid:48) \u2229 Sd\u22121\n\n2 for (cid:96) \u2208 L do\n3\n4\n5 Q := Q{p((cid:96)(cid:48)) | (cid:96) \u2208 L}\n6 u := WEIGHTED-CENTERS-SENSITIVITY(Q, 2k) // see algorithm overview.\n7 Set s : L \u2192 [0,\u221e) such that for every (cid:96) \u2208 L\n\n8 return s\n\ns((cid:96)) := u (p((cid:96)(cid:48))) .\n\nOverview of Algorithm 4. The algorithm gets a set L of lines in Rd, an integer k \u2265 1 for the number\nof desired centers and a positive integer m \u2265 1 for the coreset size, and returns an \u03b5-coreset for L;\nSee De\ufb01nition 2.7. In Line 2 a small set B that approximate the k-mean of L is computed via a call\nto BI-CRITERIA-APPROXIMATION. In Line 3 the lines are clustered according to their nearest point\nin B, and in Line 5 the lines sensitivities in the cluster Lb are computed for each center b \u2208 B. In the\nsecond \"for\" loop in Lines 8\u201310 we set the sensitivity of each line to be the sum of the scaled distance\nof the line to its nearest center b (translation), and the sensitivity sb that measure its importance with\nrespect to its direction (rotation). Here, scaled distance means that the distance is divided by the sum\nof distances over all the lines in L. The If statement in Line 7 is used to avoid division by zero. In\nLine 12 we pick a random sample S from L, where the probability of choosing a line (cid:96) is proportional\nto its sensitivity s((cid:96)). In Line 13 we assign a weight to each line, that is inverse proportional to the\nprobability of sampling it. The resulting weighted set (S, u) is returned in Line 14.\n\n6\n\n\fFigure 3: Example of running Alg. 3 in the d = 2 dimensional case. On the left, we start with a set\nL = {(cid:96)1, . . . , (cid:96)5} of lines and a single center b. Next, we translate each line onto b and stretch the\nunit sphere S around it. Finally, we replace each line with one of its two intersections with S and\nachieve the set Q = {p((cid:96)1), . . . , p((cid:96)5)}.\n\nAlgorithm 4: CORESET(L, k, m)\n\nInput:\nOutput:\n\nA \ufb01nite set L of lines in Rd, number k \u2265 1 of centers and the coreset size m \u2265 1.\nA weighted set (\u201ccoreset\u201d) (S, u) that satis\ufb01es Theorem 2.8.\n\n1 j := cdk log2 k, where c is a suf\ufb01cient large constant c > 0 that can be determined from the\n\nproof of Theorem 2.6.\n2 B := BI-CRITERIA-APPROXIMATION (L, j) // see Algorithm 2\n3 Compute a partition {Lb | b \u2208 B} of L such that Lb is the set (cluster) of lines that are\n4 for every b \u2208 B do\n5\n\nclosest to the point b \u2208 B. Ties broken arbitrarily.\n\nsb := LINES-SENSITIVITY(Lb, b, k)\n// the sensitivity of each line (cid:96) \u2208 Lb that was translated onto b;\n\nsee Algorithm 3\n\n6 for every b \u2208 B and (cid:96) \u2208 Lb do\n\n+ 2 \u00b7 sb((cid:96))\n\n7\n\n8\n\n9\n10\n\nif cost(L, B) > 0 then\nD((cid:96), b)\n\ns((cid:96)) :=\n\ncost(L, B)\n\nelse\n\ns((cid:96)) := sb((cid:96))\n\ns((cid:96))(cid:80)\n\nprob((cid:96)) :=\n\n(cid:96)(cid:48)\u2208L s((cid:96)(cid:48))\n\n11\n12 Pick a sample S of at least m lines from L, where each line (cid:96) \u2208 L is sampled i.i.d. with\n13 Set u : S \u2192 [0,\u221e) such that for every (cid:96) \u2208 S\n\nprobability prob((cid:96)).\n\nu((cid:96)) :=\n\n1\n\n|S| prob((cid:96))\n\n.\n\n14 return (S, u)\n\n4 Experimental Results\n\nFollowing motivation to narrow the gap between the theoretical and practical \ufb01elds, experiments took\na dominant place during research.\nSoftware. We implemented our coreset construction from Algorithm 4 and its sub-procedures in\nPython V. 3.6. We make use of the MKL package [26] to improve its performance, but it is not\nnecessary in order to run it. The source code can be found in [1].\nData Sets. We evaluate our system on two types of data sets: synthetic data generated with carefully\ncontrolled parameters, and a real data of roads map from the \"Open Street Map\" Dataset [13] and\n\"SimpleHome XCS7 1002 WHT Security Camera\" from the the \"UCI Machine Learning Repository\"\nDataset [4]. The roads dataset [13] contains n = 10, 000 roads in China from the \"Open Street Map\"\ndataset (Fig. 4 plot (a)), each road is represented as a 2-dimensional segment that was stretched into\n\n7\n\n\ud835\udc4f\u21131\u21132\u21133\u21134\u21135\ud835\udc4f\u21131\u21132\u21133\u21134\u21135\ud835\udc5d(\u21132)\ud835\udc5d(\u21131)\ud835\udc5d(\u21133)\ud835\udc5d(\u21134)\ud835\udc5d(\u21135)\ud835\udc4f\fan in\ufb01nite line on a plane. Synthetic data of n = 10, 000 lines was generated as well (Fig. 4 plot (b)).\nExperiments on of\ufb02ine data analysis. In Plot (a) in Graph 4, when the sample size was m = 700\nlines out of 10,000 given lines, the coreset error and variance were 1.86 and 0.16, respectively, that\nis an error of \u03b5 = 0.86, for a sample size of m = (cid:100)602/\u03b5(cid:101) lines. On the other hand, the error and\nvariance of the competitor algorithm with the same sample size were 2.62 and 0.26. This implies that\nour coreset is more accurate and stable than RANSAC, and that our mathematically provable constant\napproximation algorithm for k-line mean works better than a standard EM algorithm also in practice.\nExperiments on streaming data analysis. Plot (c) in Graph 4 demonstrates the size of the merge-\nand-reduce streaming coreset tree during the streaming, which is logarithmic in the number of lines\nwe streamed so far. In Plot (d) in Graph 4, we can see how the coreset construction running time\ndecreases linearly as the number of machines in the machines cluster increases (parameters are\nwritten in the chart\u2019s title), where coreset construction was measured 3 different times on 3 different\nnumber of centers. Note that the decrease rate is almost linear in the cluster\u2019s machines number and\nnot exactly, due to overhead of communications and I/O.\n\nFigure 4: Experiment Results. Graphs (a) and (b) re\ufb02ects the error decreasing rate in compare to an\nincreasing size of sample by coreset and uniform sampling. Graph (c) shows how the amount of data\nrequired by the merge-and-reduce coreset tree is logarithmic in the number of lines read so far in a\nstream. Graph (d) demonstrate how the coreset construction time decreases linearly in the number of\nmachines in Amazon EC2 cluster, where coreset samples were taken by different number of centers,\npreserving the invariant.\n\n5 Conclusions and Future Work\n\nThis paper purposes a deterministic algorithm that computes an \u03b5-coreset of size near-logarithmic\nin the input. Moreover, we suggest a streaming algorithm that computes a (1 + \u03b5)-approximation\nfor the k-means of a set of lines that is distributed among M machines, where each machine needs\nto send only near-logarithmic number of input lines to the main server for its computations. Other\nfuture work will consider an input of j-dimensional af\ufb01ne sub-spaces in Rd (here input of lines is a\nprivate case of j = 1), where the motivation is multiple missing entries completion.\n\n8\n\n\fReferences\n[1] Github. https://github.com/YairMarom/k_lines_means, 2019.\n\n[2] Supplementary material. https://arxiv.org/abs/1903.06904, 2019.\n\n[3] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In\nProceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages\n1027\u20131035. Society for Industrial and Applied Mathematics, 2007.\n\n[4] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.\n\n[5] Francis R Bach and Michael I Jordan. Learning spectral clustering. In Advances in neural\n\ninformation processing systems, pages 305\u2013312, 2004.\n\n[6] Olivier Bachem, Mario Lucic, and Silvio Lattanzi. One-shot coresets: The case of k-clustering.\n\narXiv preprint arXiv:1711.09649, 2017.\n\n[7] Moses Charikar, Sudipto Guha, \u00c9va Tardos, and David B Shmoys. A constant-factor ap-\nproximation algorithm for the k-median problem. Journal of Computer and System Sciences,\n65(1):129\u2013149, 2002.\n\n[8] Ludwig Danzer. \" helly\u2019s theorem and its relatives,\" in convexity. In Proc. Symp. Pure Math.,\n\nvolume 7, pages 101\u2013180. Amer. Math. Soc., 1963.\n\n[9] Amazon EC2. Amazon ec2. https://aws.amazon.com/ec2/, 2010.\n\n[10] Dan Feldman and Leonard J Schulman. Data reduction for weighted and outlier-resistant\nclustering. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete\nAlgorithms, pages 1343\u20131354. Society for Industrial and Applied Mathematics, 2012.\n\n[11] Jie Gao, Michael Langberg, and Leonard J Schulman. Analysis of incomplete data and an\nintrinsic-dimension helly theorem. Discrete & Computational Geometry, 40(4):537\u2013560, 2008.\n\n[12] Jie Gao, Michael Langberg, and Leonard J Schulman. Clustering lines in high-dimensional\nspace: Classi\ufb01cation of incomplete data. ACM Transactions on Algorithms (TALG), 7(1):8,\n2010.\n\n[13] Mordechai Haklay and Patrick Weber. Openstreetmap: User-generated street maps. Ieee Pervas\n\nComput, 7(4):12\u201318, 2008.\n\n[14] Sariel Har-Peled. Coresets for discrete integration and clustering. In International Conference on\nFoundations of Software Technology and Theoretical Computer Science, pages 33\u201344. Springer,\n2006.\n\n[15] K\ufb01r and Feldman. Coresets for big data learning of gaussian mixture models of any shape.\n\n2018.\n\n[16] Andreas Krause, Pietro Perona, and Ryan G Gomes. Discriminative clustering by regularized\nIn Advances in neural information processing systems, pages\n\ninformation maximization.\n775\u2013783, 2010.\n\n[17] Euiwoong Lee and Leonard J Schulman. Clustering af\ufb01ne subspaces: hardness and algorithms.\nIn Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms,\npages 810\u2013827. SIAM, 2013.\n\n[18] Yufeng Liu, Rosemary Emery, Deepayan Chakrabarti, Wolfram Burgard, and Sebastian Thrun.\nUsing em to learn 3d models of indoor environments with mobile robots. In ICML, volume 1,\npages 329\u2013336, 2001.\n\n[19] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory,\n\n28(2):129\u2013137, 1982.\n\n[20] Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, and Ronald R Coifman. Diffusion maps,\nIn Advances in neural\n\nspectral clustering and eigenfunctions of fokker-planck operators.\ninformation processing systems, pages 955\u2013962, 2006.\n\n9\n\n\f[21] Bj\u00f6rn Ommer and Jitendra Malik. Multi-scale object detection by clustering lines. In Computer\n\nVision, 2009 IEEE 12th International Conference on, pages 484\u2013491. IEEE, 2009.\n\n[22] Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness\nof lloyd-type methods for the k-means problem. In Foundations of Computer Science, 2006.\nFOCS\u201906. 47th Annual IEEE Symposium on, pages 165\u2013176. IEEE, 2006.\n\n[23] Tomer Perets. Clustering of lines. Open University of Israel, 2011.\n\n[24] Huamin Ren, Hong Pan, S\u00f8ren Ingvor Olsen, and Thomas B Moeslund. How does structured\n\nsparsity work in abnormal event detection? In ICML\u201915 Workshop, 2015.\n\n[25] Jie Shen, Ping Li, and Huan Xu. Online low-rank subspace clustering by basis dictionary\n\npursuit. In International Conference on Machine Learning, pages 622\u2013631, 2016.\n\n[26] Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan\n\nWang. High-performance computing on the intel xeon phi. Springer, 5:2, 2014.\n\n[27] David Williams and Lawrence Carin. Analytical kernel matrix completion with incomplete\nmulti-view data. In Proceedings of the International Conference on Machine Learning (ICML)\nWorkshop on Learning with Multiple Views, pages 80\u201386, 2005.\n\n[28] Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering.\n\nIn Advances in neural information processing systems, pages 1537\u20131544, 2005.\n\n10\n\n\f", "award": [], "sourceid": 6975, "authors": [{"given_name": "Yair", "family_name": "Marom", "institution": "University of Haifa"}, {"given_name": "Dan", "family_name": "Feldman", "institution": "University of Haifa"}]}