{"title": "Coresets for Archetypal Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 7247, "page_last": 7255, "abstract": "Archetypal analysis represents instances as linear mixtures of prototypes (the archetypes) that lie on the boundary of the convex hull of the data. Archetypes are thus often better interpretable than factors computed by other matrix factorization techniques. However, the interpretability comes with high computational cost due to additional convexity-preserving constraints. In this paper, we propose efficient coresets for archetypal analysis. Theoretical guarantees are derived by showing that quantization errors of k-means upper bound archetypal analysis; the computation of a provable absolute-coreset can be performed in only two passes over the data. Empirically, we show that the coresets lead to improved performance on several data sets.", "full_text": "Coresets for Archetypal Analysis\n\nSebastian Mair\n\nLeuphana University, Germany\n\nmair@leuphana.de\n\nUlf Brefeld\n\nLeuphana University, Germany\n\nbrefeld@leuphana.de\n\nAbstract\n\nArchetypal analysis represents instances as linear mixtures of prototypes (the\narchetypes) that lie on the boundary of the convex hull of the data. Archetypes are\nthus often better interpretable than factors computed by other matrix factorization\ntechniques. However, the interpretability comes with high computational cost due\nto additional convexity-preserving constraints. In this paper, we propose ef\ufb01cient\ncoresets for archetypal analysis. Theoretical guarantees are derived by showing that\nquantization errors of k-means upper bound archetypal analysis; the computation\nof a provable absolute-coreset can be performed in only two passes over the data.\nEmpirically, we show that the coresets lead to improved performance on several\ndata sets.\n\n1\n\nIntroduction\n\nArchetypal analysis (Cutler and Breiman, 1994) is an unsupervised learning method that represents\nevery data point as a convex combination of prototypes, the so-called archetypes. Every data point is\nrepresented as a convex mixture of (a subset of) archetypes and, due to the convexity, these mixtures\nare often interpreted probabilistically.\nA key property of archetypal analysis is that the archetypes are themselves convex mixtures of data\npoints. Consequently, archetypes lie on the boundary of the convex hull of the data. Hence, archetypal\nanalysis approximates the convex hull with a given number of vertices. It follows that this approx-\nimation is equivalent to a matrix factorization of the design matrix. Due to the convexity constraints,\narchetypal-based factorizations are not only better interpretable but unfortunately also much more\nexpensive than regular matrix factorization techniques, which hinders usage at even moderate scales.\nSeveral approaches have been proposed to remedy the edacious nature of archetypal analysis,\nproposing, e.g., ef\ufb01cient active-set quadratic programming (Chen et al., 2014), projected gradients\n(M\u00f8rup and Hansen, 2012), or Frank-Wolfe techniques (Bauckhage et al., 2015) for optimization. Ap-\nproximate solutions compute archetypes on a precomputed subset of the data, e.g., (Mair et al., 2017).\nAlthough these approaches are useful contributions, they do not mitigate the inherent complexity\nof archetypal analysis nor provide theoretical guarantees on the quality of the approximation.\nA theoretically sound alternative is offered by coresets. Coresets compactly represent large data sets\nby weighted subsets on which models perform provably competitive compared to operations on all\ndata. Coresets have successfully been leveraged to very different methods, including k-means (Lucic\net al., 2016; Bachem et al., 2018), support vector machines (Tsang et al., 2005), logistic regression\n(Munteanu et al., 2018), and Bayesian inference (Huggins et al., 2016; Campbell and Broderick,\n2018). The idea is as follows: A small subset of the data is selected (in linear time) according to\na strategy such that the subset approximates the original data very well. A learning algorithm will\nthen perform similarly on the original data and the subset, but training on the subset is much more\nef\ufb01cient. In this paper, we present coresets for archetypal analysis.\nThe key contributions of this paper are as follows: (i) We show that the objective function of k-means\nupper bounds the objective of archetypal analysis and show that every coreset for k-means is also a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcoreset for archetypal analysis, (ii) we propose a simple and ef\ufb01cient sampling strategy to compute a\ncoreset with only two passes over the data, so that (iii) a weak \u03b5-absolute-coreset is guaranteed to be\nobtained after sampling suf\ufb01ciently many points where (iv) the error bound does not depend on the\nquery itself. Finally, (v) we provide empirical results on various data sets to support the theoretical\nderivation.\n\n2 Preliminaries\n\n2.1 Archetypal Analysis\nLet X = {x1, . . . , xn}n\ni=1 be a data set consisting of n \u2208 N\nd-dimensional data points, X \u2208 Rn\u00d7d be the design matrix\nand k \u2208 N be the latent dimensionality. In archetypal analy-\nsis (Cutler and Breiman, 1994), every data point xi is repre-\nsented as a convex combination of k archetypes z1, . . . , zk,\ni.e.,\n\nxi = Z(cid:62)ai,\n\n(ai)j = 1,\n\n(ai)j \u2265 0,\n\nd(cid:88)\n\nj=1\n\nwhere ai \u2208 Rk is the weight vector of the ith data point\nand Z \u2208 Rk\u00d7d is the matrix of stacked archetypes. The\narchetypes zj (j = 1, . . . , k) themselves are also repre-\nsented as convex combinations of data points, i.e.,\n\nFigure 1: Archetypal analysis in two\ndimensions with k = 3 archetypes.\n\nn(cid:88)\n\n(cid:88)\n\nx\u2208X\n\nzj = X(cid:62)bj,\n\n(bj)i = 1,\n\n(bj)i \u2265 0,\n\nwhere bj \u2208 Rn is the weight vector of the jth archetype. Let A \u2208 Rn\u00d7k and B \u2208 Rk\u00d7n be the\nmatrices consisting of the weights ai (i = 1, . . . , n) and bj (j = 1, . . . , k). Then, archetypal analysis\nyields a factorization of the design matrix as follows\n\ni=1\n\nX = ABX = AZ,\n\n(1)\nwhere Z = BX \u2208 Rk\u00d7d is the matrix of archetypes. Due to the convexity constraints, the weight\nmatrices A and B are row-stochastic. By minimizing the residual sum of squares (RSS), given by\n(2)\nthe optimal weight matrices A and B can be found. The objective function can be rewritten as a sum\nof projections of the data points to the archetype-induced convex hull as follows\n\nRSS(k) = (cid:107)X \u2212 ABX(cid:107)2\nF ,\n\n(cid:107)X \u2212 AZ(cid:107)2\n\nF =\n\nmin\n\nq\u2208conv({z1,...,zk})\n\n(cid:107)x \u2212 q(cid:107)2\n2,\n\nwhere conv(S) refers to the convex hull of a set S. The minimization over points on the convex hull\nof the data renders the optimization infeasible for real sized problems. Although feasible optimization\nstrategies (Chen et al., 2014; Bauckhage et al., 2015; M\u00f8rup and Hansen, 2012) and approximations\n(Mair et al., 2017; Mei et al., 2018) have been proposed, they all suffer from large dimensionalities\nand/or sample sizes.\n\nfunction of the form \u03c6X (Q) =(cid:80)\n\n2.2 Coresets\nLet X be a data set of n points in d dimensions. Consider a learning problem with an objective\nx\u2208X d(x, Q)2. The goal is to learn the so-called query Q \u2282 Rd,\nwith |Q| = k, and d(x, Q)2 is the minimal squared distance from a data point x to the query\nQ. For example, in k-means clustering, Q refers to the set of cluster centers and d(x, Q)2 =\n(cid:88)\nminq\u2208Q (cid:107)x \u2212 q(cid:107)2\n\n2. The objective function is then given by\n\n(cid:88)\n\n\u03c6X (Q) =\n\nd(x, Q)2 =\n\nx\u2208X\n\n(cid:107)x \u2212 q(cid:107)2\n2.\n\nmin\nq\u2208Q\n\nx\u2208X\n\n2\n\n\fAlgorithm 1 Lightweight coreset construction for k-means (Bachem et al., 2018)\n\nInput: Set of data points X , coreset size m\nOutput: Coreset C\n\u00b5 \u2190 mean of X\nfor x \u2208 X do\nq(x) = 1\n2\n\n1|X| + 1\n\nd(x,\u00b5)2\nx(cid:48) d(x(cid:48),\u00b5)2\n\n(cid:80)\n\n2\n\nend for\nC \u2190 sample m points from X where each point has weight\n\n1\n\nm\u00b7q(x) and is sampled with prob. q(x)\n\nwi \u2265 0 on the data points, the objective becomes \u03c6X (Q) = (cid:80)\n\nA coreset is a possibly weighted subset C of the full data set X with cardinality m (cid:28) n, which\nperforms provably competitive with respect to the performance on X . Using non-negative weights\nx\u2208X wi \u00b7 d(x, Q)2. The standard\nde\ufb01nition of a coreset is as follows.\nDe\ufb01nition 1. Let \u03b5 > 0 and k \u2208 N. A (weighted) set C is a (\u03b5, k)-coreset of the data X if for any\nQ \u2282 Rd of cardinality at most k\n\n|\u03c6X (Q) \u2212 \u03c6C(Q)| \u2264 \u03b5\u03c6X (Q).\n\n(3)\nThe condition in the de\ufb01nition of a coreset is equivalent to (1\u2212 \u03b5)\u03c6X (Q) \u2264 \u03c6C(Q) \u2264 (1 + \u03b5)\u03c6X (Q).\nHence, the performance of the query learned on the coreset is bounded from below and above by\na (1 \u00b1 \u03b5) multiplicative of the query evaluated on the full data set. Note that De\ufb01nition 1 de\ufb01nes a\nstrong coreset since the bound holds uniformly for all queries Q. If the condition in Equation (3)\nholds only for the optimal query, C is called a weak coreset.\nComputing a coreset for k-means may require k sequential passes over the data (Lucic et al., 2016).\nBachem et al. (2018) introduce the notion of lightweight-coresets, which allow for an additional\nadditive term on the right hand side of the bound in Equation (3) and show that the solution can be\ncomputed in only two passes over the data. The de\ufb01nition of lightweight-coresets is as follows.\nDe\ufb01nition 2. (Bachem et al., 2018) Let \u03b5 > 0, k \u2208 N and X \u2282 Rd be a set of points with mean\n\u00b5 \u2208 Rd. The weighted set C is a (\u03b5, k)-lightweight-coreset of the data X if for any Q \u2282 Rd of\ncardinality at most k\n\n|\u03c6X (Q) \u2212 \u03c6C(Q)| \u2264 \u03b5\n2\n\n\u03c6X (Q) +\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u03c6X ({\u00b5})\n\n\u03b5\n2\nadditive term\n\n(cid:125)\n\n.\n\n(4)\n\nThe lightweight-coreset for k-means is constructed via importance sampling, in order to guide the\nsampling procedure towards more in\ufb02uential points. The sampling distribution q is a mixture of an\nuniform distibution and the squared distances to the mean, i.e.,\n(cid:80)n\nd(x, \u00b5)2\ni=1 d(xi, \u00b5)2 .\n\nq(x) =\n\n1\nn\n\n(5)\n\n1\n2\n\n1\n2\n\n+\n\nThe underlying idea is that points that lie far away from the mean \u00b5 have a larger impact on the\nobjective function and should thus be sampled with higher probability. The procedure is shown in\nAlgorithm 1. After sampling m points, each point is weighted by (m\u00b7 q(x))\u22121 such that the sampling\nprocedure yields an unbiased estimator of the quantization error:\n\nEC [\u03c6C(Q)] = EC\n\n1\n\nm \u00b7 q(x)\n\nd(x, Q)2\n\n= E\n\nd(x, Q)2\n\n=\n\nq(x)\n\nd(x, Q)2\n\nq(x)\n\n= \u03c6X (Q).\n\n(cid:35)\n\n(cid:20) 1\n\nq(x)\n\n(cid:21)\n\n(cid:88)\n\nx\u2208X\n\n(cid:34)(cid:88)\n\nx\u2208C\n\nThe following result ensures that after sampling suf\ufb01ciently many points, a (\u03b5, k)-lightweight-coreset\nis obtained.\nTheorem 1 (Bachem et al. (2018)). Let \u03b5 > 0, \u03b4 > 0 and k \u2208 N. Let X be a set of points in Rd and\nlet C be the output of Algorithm 1 with a sample size m of at least\n\nwhere c is an absolute constant. Then, with probability of at least 1 \u2212 \u03b4, C is a (\u03b5, k)-lightweight-\ncoreset of X .\n\nm \u2265 c\n\ndk log k + log 1\n\u03b4\n\n,\n\n\u03b52\n\n3\n\n\fBachem et al. (2018) argue that dropping \u03b5\n2 \u03c6X (Q) from Equation (4) is not possible for the problem\nof k-means. Assume, for example, that the cluster centers (query Q) are placed arbitrarily far away\nfrom other data. Equation (4) would show an arbitrary large difference on the left hand side but\n2 \u03c6X ({\u00b5(X )}). Hence, C cannot be a coreset\nthe error on the right hand side would be bounded by \u03b5\nbecause it does not hold uniformly for all queries.\nWhile this observation applies to k-means, the situation is very different for archetypal analysis. We\nassume that the mean \u00b5 of X is actually contained in the convex hull of the query, i.e., \u00b5 \u2208 conv(Q).\nHence, placing some points of the query far away from the data induces a larger convex hull and thus\na lower projection error. In the remainder, we also argue that queries of practical interest always lie\non the border of the convex hulls of either X or C, and that the mean \u00b5 is always included.\n\n3 Coreset Construction\n\nIn the case of archetypal analysis, the query Q consists of the archetypes z1, . . . , zk. The squared\ndistance of a point x to the query Q is given by the length of the projection of the point to the convex\nset conv(Q), i.e., d(x, Q)2 = minq\u2208conv(Q) (cid:107)x\u2212q(cid:107)2\n2. Hence, the objective function can be rewritten\nin the following way:\n\n(cid:88)\n\nx\u2208X\n\n(cid:88)\n\nx\u2208X\n\n\u03c6X (Q) =\n\nd(x, Q)2 =\n\nmin\n\nq\u2208conv(Q)\n\n(cid:107)x \u2212 q(cid:107)2\n2.\n\nIn the remainder, \u03c6X (Q) refers to the above objective of archetypal analysis.\nBefore introducing and analyzing the coreset construction for archetypal analysis, we show that for\na point x the quantization error of k-means upper bounds the projection of x to the query (i.e., the\narchetypes z1, . . . , zk) in archetypal analysis. As a consequence, every coreset that bounds the error\nof k-means must also bound the error of archetypal analysis and is thus also a coreset for archetypal\nanalysis.\n\nFigure 2: Illustration of Lemma 1. The projection of a point x to conv(Q) (left side) is smaller than\nor equal to the distance of x to the closest center q \u2208 Q (right side).\nLemma 1. Let x \u2208 Rd be a data point, d(\u00b7,\u00b7) be a distance metric and Q \u2282 Rd be any set of k \u2208 N\npoints, then it holds that\n\nmin\n\nq\u2208conv(Q)\n\nd(x, q) \u2264 min\nq\u2208Q\n\nd(x, q).\n\nProof. First, note that Q \u2282 conv(Q). Assume that q(cid:48) \u2208 conv(Q) minimizes d(x, q), then q(cid:48) is\neither in conv(Q) \\ Q, resulting in a smaller distance than any other q(cid:48)(cid:48) \u2208 Q, or q(cid:48) is in Q, yielding\nthe same distance as minq\u2208Q d(x, q). Hence, the distance of x to the convex set conv(Q) is smaller\nor equal to the distance to any q \u2208 Q.\n\nA direct consequence of Lemma 1, which is depicted in Figure 2, is that for any choice of Q, the\nobjective function of archetypal analysis is upper bounded by the objective of k-means, i.e.,\n\n(cid:88)\n\nx\u2208X\n\n(cid:107)x \u2212 q(cid:107)2\n\nmin\n\nq\u2208conv(Q)\n\n(cid:107)x \u2212 q(cid:107)2\n2.\n\nmin\nq\u2208Q\n\n2 \u2264 (cid:88)\n\nx\u2208X\n\nHere, Q in archetypal analysis refers to the archetypes z1, . . . , zk and in k-means, Q refers to the set\nof centroids. Since a coreset bounds the error of a method on the entire set, and due to Lemma 1, any\ncoreset for k-means is also a coreset for archetypal analysis.\n\n4\n\n\fAlgorithm 2 Coreset construction for Archetypal Analysis\n\nInput: Set of data points X , coreset size m\nOutput: Coreset C\n\u00b5 \u2190 mean of X\nfor x \u2208 X do\n(cid:80)\n\nq(x) = d(x,\u00b5)2\n\nx(cid:48) d(x(cid:48),\u00b5)2\n\nend for\nC \u2190 sample m points from X where each point has weight\n\n1\n\nm\u00b7q(x) and is sampled with prob. q(x)\n\nProposition 1. Every coreset for k-means is also a coreset for archetypal analysis.\n\nThe proposition implies that the sampling strategy outlined in Algorithm 1 already yields a lightweight-\ncoreset for our problem. However, we show in Section 3.1 that the term \u03b5\n2 \u03c6X (Q) can be dropped to ob-\ntain a weak \u03b5-absolute-coreset for archetypal analysis, whose bound does not depend on the query Q.\nDe\ufb01nition 3. Let \u03b5 > 0 and k \u2208 N. A (weighted) set C is an \u03b5-absolute-coreset of the data X if for\nany Q \u2282 Rd of cardinality at most k\n\nThe set C is called a weak \u03b5-absolute-coreset if the bound holds only for speci\ufb01c queries Q.\n\n|\u03c6X (Q) \u2212 \u03c6C(Q)| \u2264 \u03b5.\n\n(6)\n\nArchetypes are guaranteed to lie on the boundary of the convex hull of the data (Cutler and Breiman,\n1994).1 Thus, we are interested in points lying far from the mean \u00b5 of X . Such points increase the\nconvex hull of the archetypes and result in a smaller projections and hence in a lower value of the\nobjective function. Following the idea of Bachem et al. (2018), we thus discard the uniform term\nin Equation (5) and propose the following sampling distribution\n\nq(x) =\n\n3.1 Analysis\n\n(cid:80)n\nd(x, \u00b5)2\ni=1 d(xi, \u00b5)2 .\n\nWe now provide a bound on the sample size m to show that Algorithm 2 computes a provably\ncompetitive coreset.\nTheorem 2. Let \u03b5 > 0, \u03b4 > 0 and k \u2208 N. Let X be a set of points in Rd with mean \u00b5 \u2208 Rd and let\nC be the output of Algorithm 2 with a sample size m of at least\ndk log k + log 1\n\u03b4\n\nm \u2265 c\n\n,\n\n\u03b52\n\nwhere c is an absolute constant. Then, with probability of at least 1 \u2212 \u03b4, the set C ful\ufb01lls\n\nfor any query Q \u2282 Rd of cardinality at most k satisfying \u00b5 \u2208 conv(Q).\n\n|\u03c6X (Q) \u2212 \u03c6C(Q)| \u2264 \u03b5\u03c6X ({\u00b5})\n\n(7)\n\nA proof of Theorem 2 can be found in the supplementary material. Note that the bound on the right\nhand side of Equation (7) is independent of the query Q and corresponds to the scaled variance of the\ndata. For a normalized data set, Algorithm 2 yields an \u03b5-absolute-coreset as the following corollary\nshows.\nCorollary 1. Let \u03b5 > 0, \u03b4 > 0, k \u2208 N, and X be a set of points in Rd with mean \u00b5 \u2208 Rd. Denote by\n\u00afX the standardized set of points with \u00afxi = (xi \u2212 \u00b5)/\u03c6X ({\u00b5}). Let C be the output of Algorithm 2\non \u00afX with a sample size m of at least\n\nm \u2265 c\n\ndk log k + log 1\n\u03b4\n\n\u03b52\n\n,\n\n1Given that there is more than only a single archetype.\n\n5\n\n\fFigure 3: Relative error \u03b7 on the full data set as well as computation time in seconds (averages and\nstandard errors of 50 independent runs).\n\nwhere c is an absolute constant. Then, with probability of at least 1 \u2212 \u03b4, C is an \u03b5-absolute-coreset of\n\u00afX , i.e., it holds that\n\nfor any query Q \u2282 Rd of cardinality at most k satisfying \u00b5 \u2208 conv(Q).\n\n|\u03c6X (Q) \u2212 \u03c6C(Q)| \u2264 \u03b5\n\nCorollary 1 can be interpreted in the following way: As we decrease \u03b5, the performance gap of\narchetypal analysis on the full (standardized) data set and archetypal analysis on the coreset closes\nfor a query Q satisfying \u00b5 \u2208 conv(Q). One might ask whether this restriction on the choice of Q\nis a drawback. The assumption within the various de\ufb01nitions of coresets that the bound has to hold\nfor any choice of Q is very strong. For the problem of k-means this makes sense as the centers could\nbe anywhere in the space. However, for archetypal analysis, Cutler and Breiman (1994) show that\nthe archetypes z1, . . . , zk lie on the boundary of the data, i.e., {z1, . . . , zk} \u2282 \u2202C for k > 1. Hence,\nany meaningful query Q will be on the boundary of the coreset \u2202C as well. Such a query will likely\ncontain the mean \u00b5 of X , because C \u2282 X is sampled around \u00b5.\nAs the following theorem shows, the optimal solution Q(cid:63)C computed on the coreset C is indeed\nprovably competitive to the solution Q(cid:63)X obtained on the full data set.\nTheorem 3. Let \u03b5 > 0 and X be a set of points in Rd with mean \u00b5 \u2208 Rd. Denote by Q(cid:63)X the optimal\nsolution on X and by Q(cid:63)C the optimal solution on C. Then it holds that\n\u03c6X (Q(cid:63)C) \u2264 \u03c6X (Q(cid:63)X ) + 2\u03b5\u03c6X ({\u00b5}).\n\nA proof of Theorem 3 is provided in the supplementary material.\n\n3.2 Complexity Analysis\nAlgorithm 2 needs one full pass over the data set X of size n in order to determine the mean \u00b5. Then,\nanother pass is needed to compute the distance of each point xi to the mean \u00b5 which is needed for the\nsampling distribution q(\u00b7). Hence, the complexity of Algorithm 2 scales in O(nd). In addition, since\nq(\u00b7) is a discrete distribution on n objects, the space complexity is also in O(nd). The same arguments\napply to the lightweight-coreset construction of (Bachem et al., 2018) as outlined in Algorithm1.\n\n6\n\n\fFigure 4: Relative error \u03b7 on the full data set as well as computation time in seconds (averages and\nstandard errors of 50 independent runs).\n\n4 Experiments\n\nWe now evaluate the coreset construction for archetypal analysis (abs-cs) and compare it to the\nperformance of archetypal analysis on the full data set, an uniform sample (uniform), the lightweight-\ncoresets for k-means (lw-cs, Bachem et al. (2018)), a state-of-the-art coreset construction for k-\nmeans (lucic-cs, Lucic et al. (2016)) as well as an approximate solution that learns archetypes on a\nprecomputed subset (frame, Mair et al. (2017)).\nWe initialize the archetypes z1, . . . , zk using the furthest sum procedure (M\u00f8rup and Hansen, 2012).\nThe termination criterion is reached when the relative error between iterations is less than 10\u22123. We\nmeasure the error in terms of the residual sum of squares (RSS) in Equation (2) and compute the rela-\ntive error \u03b7 = (RSSfull \u2212 RSScoreset)/ RSSfull with respect to the performance on the full data set. We\nreport on averages over 50 independent repetitions, error bars show standard error. The code is written\nin Python and all experiments run on an Intel Xeon machine with 28\u00d72.60GHz and 256GB memory.2\nWe evaluate the algorithms on several data sets: Ijcnn1 refers to data from the IJCNN 2001 neural\nnetwork competition and has n = 49, 990 instances in d = 22 dimensions.3 We adopt the preprocess-\ning from Chang and Lin (2001). Pose is a subset of the Human3.6M data set (Catalin Ionescu, 2011;\nIonescu et al., 2014) and deals with 3D human pose estimation and is part of the ECCV 2018 Pose-\nTrack Challenge.4 It consists of n = 35, 832 poses each of which is represented as 3D coordinates of\n16 joints resulting in a 48-dimensional problem. Song is a subset of the Million Song Dataset (Bertin-\nMahieux et al., 2011) which has n = 515, 345 data points in d = 90 dimensions where the task is to\npredict the year of a song. Covertype (Blackard and Dean, 1999) contains n = 581, 012 examples\nin d = 54 dimensions. The task is to predict the forest cover type from cartographic variables.\nFigure 3 shows the results for k = 100 archetypes. In the top row, the relative error \u03b7 of each\napproach is evaluated on the full data set and illustrated for subsample sizes ranging from m = 1, 000\nto m = 8, 000, depicted on the x-axis. Unsurprisingly, the relative error decreases with an increasing\nsubsample size for all approaches. The uniform sampling strategy performs almost always worse than\nits peers. The coreset construction of Lucic et al. (2016) (lucic-cs) performs in some few cases on par\n\n2https://github.com/smair/archetypalanalysis-coreset\n3https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/\n4http://vision.imar.ro/human3.6m/challenge_open.php\n\n7\n\n\fTable 1: Relative error in percent and speed up compared to the full data set for k = 25 archetypes\n(averages and standard errors of 50 independent runs).\n\nData\nCovertype\n\nSong\n\nPose\n\nIjccn1\n\nm = 1000\n\nm = 5000\n\nRelative Error\nMethod\nuniform 181.7%\u00b1 5.2%\n150.8%\u00b1 4.1%\nlw-cs\n162.7%\u00b1 4.8%\nlucic-cs\n148.9%\u00b1 4.8%\nabs-cs\n54.4%\u00b1 0.6%\nuniform\n35.8%\u00b1 0.7%\nlw-cs\n35.6%\u00b1 0.6%\nlucic-cs\n32.1%\u00b1 0.6%\nabs-cs\n20.9%\u00b1 0.8%\nuniform\n14.2%\u00b1 0.5%\nlw-cs\n14.4%\u00b1 0.5%\nlucic-cs\n15.7%\u00b1 0.6%\nabs-cs\n7.9%\u00b1 0.5%\nuniform\n8.9%\u00b1 0.7%\nlw-cs\n9.4%\u00b1 0.8%\nlucic-cs\n8.5%\u00b1 0.6%\nabs-cs\n\nSpeedup Relative Error\n468\u00d7 94.6%\u00b1 2.9%\n553\u00d7 80.6%\u00b1 2.5%\n10\u00d7 85.4%\u00b1 2.6%\n601\u00d7 79.1%\u00b1 2.9%\n430\u00d7 31.7%\u00b1 0.4%\n480\u00d7 17.7%\u00b1 0.4%\n2\u00d7 17.5%\u00b1 0.5%\n486\u00d7 12.4%\u00b1 0.3%\n9\u00d7\n5.6%\u00b1 0.4%\n5.6%\u00b1 0.5%\n10\u00d7\n6.0%\u00b1 0.6%\n5\u00d7\n14\u00d7\n5.5%\u00b1 0.5%\n17\u00d7\n4.5%\u00b1 0.5%\n21\u00d7\n3.9%\u00b1 0.5%\n5.1%\u00b1 0.6%\n5\u00d7\n21\u00d7\n4.0%\u00b1 0.6%\n\nSpeedup\n126\u00d7\n119\u00d7\n9\u00d7\n111\u00d7\n78\u00d7\n68\u00d7\n2\u00d7\n39\u00d7\n3\u00d7\n3\u00d7\n3\u00d7\n4\u00d7\n5\u00d7\n5\u00d7\n3\u00d7\n6\u00d7\n\nwith our proposed approach (abs-cs), see for example Ijcnn1. In most other cases, the proposed coreset\nconstruction yields the best results and outperforms its competitors, especially on the Song data.\nThe bottom row in Figure 3 also depicts the relative error \u03b7, however, with respect to the average\nruntime of a single run. Theoretically, the lightweight-coreset (lw-cs) as well as the proposed coreset\nconstruction realize complexities in O(nd). In practice, however, the proposed approach yields\nconsistently lower relative errors in shorter time. We credit this \ufb01nding to a better selection of coreset\npoints resulting in a faster convergence of archetypal analysis. The method of Lucic et al. (2016)\n(lucic-cs) is consistently the slowest as it requires k passes over the data for constructing the coreset.\nFigure 4 shows the same as evaluation scenario as Figure 3 but for only k = 25 archetypes. Once\nagain, our proposed coreset construction either outperforms its peers or performs on par with\nthe lightweight-coreset construction of Bachem et al. (2018) while being more ef\ufb01cient. Table 1\nsummarizes the achieved speed-ups. On Covertype, for example, the computation of 25 archetypes\nwith abs-cs and m = 1, 000 is 601 times faster than computing the archetypes on the full data set.\nAlthough, the error is 148.9% higher than the error using archetypes learned on the full data set, all\nother competitors are consistently outperformed. Increasing the size of the coreset to m = 5, 000\nyields a much lower relative error of 79.1% while still being 111 times faster to compute. Similar\nresults with less speed-up but also much less relative errors are obtained for the other data sets.\nThe remaining competitor frame (Mair et al., 2017) precomputes all data points lying on the boundary\nof the convex hull of the data set (the frame). We were not able to compute the frame within a\nreasonable amount of time for Covertype and Song.5 On Pose, every data point lies on the boundary,\nhence the performance is identical to the performance on all data. For Ijcnn1, the number of points\non the frame is about 0.57n and the relative error \u03b7 is about 0.03 for k = 100 archetypes. While this\nerror is much lower, the subset size is also much larger. In addition, m is not chosen but implicitly\ngiven by the data set.\n\n5 Conclusion\n\nWe introduced coresets for archetypal analysis. The derivation was grounded on the observation that\nthe quantization error of k-means serves as an upper bound on the projection error of archetypal\nanalysis; hence, every coreset for k-means is also a coreset for archetypal analysis. We devised an\nalgorithm based on importance sampling that computes a coreset in linear time with only two passes\n\n5Computations take about 2,000 and 4,000 hours for Covertype and Song, respectively.\n\n8\n\n\fover the data. A theoretical analysis showed that the proposed coreset performed competitive for a\nsuf\ufb01ciently large sample size. The theoretical results are supported by empiricism. The proposed\nalgorithm outperformed its competitors on various data sets in terms of relative error and computation\ntime. For some setups, we observed improved run-times. In sum, our contribution rendered archetypal\nanalysis feasible for state-of-the-art-sized data sets.\n\nReferences\nBachem, O., Lucic, M., and Krause, A. (2018). Scalable k-means clustering via lightweight coresets.\nIn Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &\nData Mining, pages 1119\u20131127. ACM.\n\nBauckhage, C., Kersting, H., Thurau, C., et al. (2015). Archetypal analysis as an autoencoder. In\n\nWorkshop New Challenges in Neural Computation.\n\nBertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. (2011). The million song dataset. In\nProceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011).\n\nBlackard, J. A. and Dean, D. J. (1999). Comparative accuracies of arti\ufb01cial neural networks and\ndiscriminant analysis in predicting forest cover types from cartographic variables. Computers and\nelectronics in agriculture, 24(3):131\u2013151.\n\nCampbell, T. and Broderick, T. (2018). Bayesian coreset construction via greedy iterative geodesic\n\nascent. In International Conference on Machine Learning, pages 697\u2013705.\n\nCatalin Ionescu, Fuxin Li, C. S. (2011). Latent structured models for human pose estimation. In\n\nInternational Conference on Computer Vision.\n\nChang, C.-c. and Lin, C.-J. (2001). Ijcnn 2001 challenge: Generalization ability and text decod-\ning. In IJCNN\u201901. International Joint Conference on Neural Networks. Proceedings (Cat. No.\n01CH37222), volume 2, pages 1031\u20131036. IEEE.\n\nChen, Y., Mairal, J., and Harchaoui, Z. (2014). Fast and robust archetypal analysis for representation\nlearning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 1478\u20131485.\n\nCutler, A. and Breiman, L. (1994). Archetypal analysis. Technometrics, 36(4):338\u2013347.\n\nHuggins, J., Campbell, T., and Broderick, T. (2016). Coresets for scalable bayesian logistic regression.\n\nIn Advances in Neural Information Processing Systems, pages 4080\u20134088.\n\nIonescu, C., Papava, D., Olaru, V., and Sminchisescu, C. (2014). Human3.6m: Large scale datasets\nand predictive methods for 3d human sensing in natural environments. IEEE Transactions on\nPattern Analysis and Machine Intelligence.\n\nLucic, M., Bachem, O., and Krause, A. (2016). Strong coresets for hard and soft bregman clustering\n\nwith applications to exponential family mixtures. In Arti\ufb01cial Intelligence and Statistics.\n\nMair, S., Boubekki, A., and Brefeld, U. (2017). Frame-based data factorizations. In International\n\nConference on Machine Learning, pages 2305\u20132313.\n\nMei, J., Wang, C., and Zeng, W. (2018). Online dictionary learning for approximate archetypal\nanalysis. In Proceedings of the European Conference on Computer Vision (ECCV), pages 486\u2013501.\n\nM\u00f8rup, M. and Hansen, L. K. (2012). Archetypal analysis for machine learning and data mining.\n\nNeurocomputing, 80:54\u201363.\n\nMunteanu, A., Schwiegelshohn, C., Sohler, C., and Woodruff, D. (2018). On coresets for logistic\n\nregression. In Advances in Neural Information Processing Systems, pages 6561\u20136570.\n\nTsang, I. W., Kwok, J. T., and Cheung, P.-M. (2005). Core vector machines: Fast svm training on\n\nvery large data sets. Journal of Machine Learning Research, 6(Apr):363\u2013392.\n\n9\n\n\f", "award": [], "sourceid": 3951, "authors": [{"given_name": "Sebastian", "family_name": "Mair", "institution": "Leuphana University"}, {"given_name": "Ulf", "family_name": "Brefeld", "institution": "Leuphana"}]}