{"title": "Fast Determinantal Point Process Sampling with Application to Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 2319, "page_last": 2327, "abstract": "Determinantal Point Process (DPP) has gained much popularity for modeling sets of diverse items. The gist of DPP is that the probability of choosing a particular set of items is proportional to the determinant of a positive definite matrix that defines the similarity of those items. However, computing the determinant requires time cubic in the number of items, and is hence impractical for large sets. In this paper, we address this problem by constructing a rapidly mixing Markov chain, from which we can acquire a sample from the given DPP in sub-cubic time. In addition, we show that this framework can be extended to sampling from cardinality-constrained DPPs. As an application, we show how our sampling algorithm can be used to provide a fast heuristic for determining the number of clusters, resulting in better clustering.", "full_text": "Fast Determinantal Point Process Sampling with\n\nApplication to Clustering\n\nSamsung Advanced Institute of Technology\n\nByungkon Kang \u2217\n\nYongin, Korea\n\nbk05.kang@samsung.com\n\nAbstract\n\nDeterminantal Point Process (DPP) has gained much popularity for modeling sets\nof diverse items. The gist of DPP is that the probability of choosing a particular\nset of items is proportional to the determinant of a positive de\ufb01nite matrix that de-\n\ufb01nes the similarity of those items. However, computing the determinant requires\ntime cubic in the number of items, and is hence impractical for large sets. In this\npaper, we address this problem by constructing a rapidly mixing Markov chain,\nfrom which we can acquire a sample from the given DPP in sub-cubic time. In ad-\ndition, we show that this framework can be extended to sampling from cardinality-\nconstrained DPPs. As an application, we show how our sampling algorithm can\nbe used to provide a fast heuristic for determining the number of clusters, resulting\nin better clustering.\nThere are some crucial errors in the proofs of the theorem which invalidate the\ntheoretical claims of this paper. Please consult the appendix for more details.\n\n1\n\nIntroduction\n\nDeterminantal Point Process (DPP) [1] is a well-known framework for representing a probability\ndistribution that models diversity. Originally proposed to model repulsion among physical particles,\nit has found its way into many applications in AI, such as image search [2] and text summariza-\ntion [3].\nIn a nutshell, given an itemset S = [n] = {1, 2,\u00b7\u00b7\u00b7 , n} and a symmetric positive de\ufb01nite (SPD)\nmatrix L \u2208 Rn\u00d7n that describes pairwise similarities, a (discrete) DPP is a probability distribution\nover 2S proportional to the determinant of the corresponding submatrix of L. It is known that this\ndistribution assigns more probability mass on set of points that have larger diversity, which is quan-\nti\ufb01ed by the entries of L.\nAlthough the size of the support is exponential, DPP offers tractable inference and sampling algo-\nrithms. However, sampling from a DPP requires O(n3) time, as an eigen-decomposition of L is\nnecessary [4]. This presents a huge computational problem when there are a large number of items;\ne.g., n > 104. A motivating problem we consider is that of kernelized clustering [5]. In this problem,\nwe are given a large number of points plus a kernel function that serves as a dot product between\nthe points in a feature space. The objective is to partition the points into some number clusters, each\nrepresented by a point called centroid, in a way that a certain cost function is minimized. Our ap-\nproach is to sample the centroids via DPP. This heuristic is based on the fact that each cluster should\nbe different from one another as much as possible, which is precisely what DPPs prefer. Naively\nusing the cubic-complexity sampling algorithm is inef\ufb01cient, since it can take up to a whole day to\neigen-decompose a 10000 \u00d7 10000 matrix.\nIn this paper, we present a rapidly mixing Markov chain whose stationary distribution is the DPP\n\n\u2217This work was submitted when the author was a graduate student at KAIST.\n\n1\n\n\fPL(Y = Y ) =\n\ndet(LY )\n\nPY \u2032\u2286S det(LY \u2032)\n\n=\n\ndet(LY )\ndet(L + I)\n\n,\n\nof interest. Our Markov chain does not require the eigen-decomposition of L, and is hence time-\nef\ufb01cient. Moreover, our algorithm works seamlessly even when new items are added to S (and L),\nwhile the previous sampling algorithm requires expensive decompositions whenever S is updated.\n\n1.1 Settings\n\nMore speci\ufb01cally, a DPP over the set S = [n], given a positive-de\ufb01nite similarity matrix L \u227b 0, is\na probability distribution PL over any Y \u2286 S in the following form:\n\nwhere I is the identity matrix of corresponding dimension, Y is a random subset of S, and LY \u227b 0\nis the principal minior of L whose rows and columns are restricted to the elements of Y .\ni.e.,\nLY = [L(i, j)]i,j\u2208Y , where L(i, j) is the (i, j) entry of L. Many literatures introduce DPP in terms\nof a marginal kernel that describes marginal probabilities of inclusion. However, since directly\nmodeling probabilities over each subset of S1 offers a more \ufb02exible framework, we will focus on\nthe latter representation.\n\nThere is a variant of DPPs that places a constraint on the size of the random subsets. Given an\ninteger k, a k-DPP is a DPP over size-k sets [2]:\n\nP k\n\nL(Y = Y ) =(\n\ndet(LY )\n\nP |Y \u2032 |=k det(LY \u2032 ) ,\n\n0,\n\nif |Y | = k\notherwise.\n\nDuring the discussions, we will use a characteristic vector representation of each Y \u2286 S; i.e.,\nvY \u2208 {0, 1}|S|,\u2200Y \u2286 S, such that vY (i) = 1 if i \u2208 Y , and 0 otherwise. With abuse of notation,\nwe will often use set operations on characteristic vectors to indicate the same operation on the\ncorresponding sets. e.g., vY \\ {u} is equivalent to setting vY (u) = 0 and correspondingly, Y \\ {u}.\n2 Algorithm\n\nThe overall idea of our algorithm is to design a rapidly-mixing Markov chain whose stationary\ndistribution is PL. The state space of our chain consists of the characteristic vectors of each subset\nof S. This Markov chain is generated by a standard Metropolis-Hastings algorithm, where the\ntransition probability from state vY to vZ is given as the ratio of PL(Z) to PL(Y ). In particular, we\nwill only consider transitions between adjacent states - states that have Hamming distance 1. Hence,\nthe transition probability of removing an element u from Y is of the following form:\n\nPr(Y \u222a {u} \u2192 Y ) = min(cid:26)1,\n\ndet(LY )\n\ndet(LY \u222a{u})(cid:27) .\n\nThe addition probability is de\ufb01ned similarly. The overall chain is an insertion/deletion chain, where\na uniformly proposed element is either added to, or subtracted from the current state. This procedure\nis outlined in Algorithm 1. Note that this algorithm has a potentially high computational complexity,\nas the determinant of LY for a given Y \u2286 S must be computed on every iteration. If the size of Y\nis large, then a single iteration will become very costly. Before discussing how to address this issue\nin Section 2.1, we analyze the properties of Algorithm 1 to show that it ef\ufb01ciently samples from PL.\nFirst, we state that the chain induced by Algorithm 1 does indeed represent our desired distribution2.\nProposition 1. The Markov chain in Algorithm 1 has a stationary distribution PL.\n\nThe computational complexity of sampling from PL using Algorithm 1 depends on the mixing time\nof the Markov chain; i.e., the number of steps required in the Markov chain to ensure that the current\ndistribution is \u201cclose enough\u201d to the stationary distribution. More speci\ufb01cally, we are interested\nin the \u01eb-mixing time \u03c4 (\u01eb), which guarantees a distribution that is \u01eb-close to PL in terms of total\nvariation. In other words, we must spend at least this many time steps in order to acquire a sample\nfrom a distribution close to PL. Our next result states that the chain in Algorithm 1 mixes rapidly:\n\n1Also known as L-ensembles.\n2All proofs, including those of irreducibility of our chains, are given in the Appendix of the full version of\n\nour paper.\n\n2\n\n\fAlgorithm 1 Markov chain for sampling from PL\nRequire: Itemset S = [n], similarity matrix L \u227b 0\n\nRandomly initialize state Y \u2286 S\nwhile Not mixed do\n\nSample u \u2208 S uniformly at random\nSet\n\np+\n\nu (Y ) \u2190 min(cid:26)1,\nu (Y ) \u2190 min(cid:26)1,\n\np\u2212\n\ndet(LY \u222a{u})\n\ndet(LY ) (cid:27)\ndet(LY ) (cid:27)\n\ndet(LY \\{u})\n\nif u /\u2208 Y then\nelse\n\nY \u2190 Y \u222a {u} with prob. p+\nY \u2190 Y \\ {u} with prob. p\u2212\n\nend if\n\nu (Y )\n\nu (Y )\n\nend while\nreturn Y\n\nTheorem 1. The Markov chain in Algorithm 1 has a mixing time \u03c4 (\u01eb) = O (n log (n/\u01eb)).\n\nOne advantage of having a rapidly-mixing Markov chain as means of sampling from a DPP is that it\nis robust to addition/deletion of elements. That is, when a new element is introduced to or removed\nfrom S, we may simply continue the current chain until it is mixed again to obtain a sample from\nthe new distribution. Previous sampling algorithm, on the other hand, requires to expensively eigen-\ndecompose the updated L.\nThe mixing time of the chain in Algorithm 1 seems to offer a promising direction for sampling\nfrom PL. However, note that this is subject to the presence of an ef\ufb01cient procedure for computing\ndet(LY ). Unfortunately, computing the determinant already costs O(|Y |3) operations, rendering\nAlgorithm 1 impractical for large Y \u2019s.\nIn the following sections, we present a linear-algebraic\nmanipulation of the determinant ratio so that explicit computation of the determinants is unnecessary.\n\n2.1 Determinant Ratio Computation\n\nIt turns out that we do not need to explicitly compute the determinants, but rather the ratio of determi-\nnants. Suppose we wish to compute det(LY \u222a{u})/ det(LY ). Since the determinant is permutation-\ninvariant with respect to the index set, we can represent LY \u222a{u} as the following block matrix form,\ndue to its symmetry:\n\nLY \u222a{u} =(cid:18)LY\n\nb\u22a4\nu\n\nbu\n\ncu(cid:19) ,\n\nwhere bu = (L(i, u))i\u2208Y \u2208 R|Y | and cu = L(u, u). With this, the determinant of LY \u222a{u} is\nexpressed as\n(1)\n\nThis allows us to re-formulate the insertion transition probability as a determinant-free ratio.\n\np+\n\nu (Y ) = min(cid:26)1,\n\nThe deletion transition probability p\u2212\n\np\u2212\n\nu (Y \u222a {u}) = min(cid:26)1,\n\ndet(LY \u222a{u}) = det(LY )(cid:0)cu \u2212 b\u22a4\n\nu L\u22121\n\ndet(LY \u222a{u})\n\nY bu(cid:1) .\ndet(LY ) (cid:27) = min(cid:8)1, cu \u2212 b\u22a4\nu (Y \u222a {u}) is computed likewise:\ndet(LY \u222a{u})(cid:27) = min(cid:8)1, (cu \u2212 b\u22a4\n\ndet(LY )\n\nu L\u22121\n\nY bu(cid:9) .\n\n(2)\n\nu L\u22121\n\nY bu)\u22121(cid:9) .\n\nHowever, this transformation alone does not seem to result in enhanced computation time, as com-\nputing the inverse of a matrix is just as time-consuming as computing the determinant.\n\n3\n\n\fTo save time on computing L\u22121\nversion. Suppose that the matrix L\u22121\nFirst, consider the case when an element u is added (\u2018if\u2019 clause). The new inverse L\u22121\nupdated from the current L\u22121\n\nY \u2032 , we incrementally update the inverse through blockwise matrix in-\nY has already been computed at the current iteration of the chain.\nY \u222a{u} must be\n\nY . This is achieved by the following block-inversion formula [6]:\n\nL\u22121\n\nY \u222a{u} =(cid:18)LY\n\nb\u22a4\nu\n\nbu\n\ncu(cid:19)\u22121\n\n=(cid:18)L\u22121\n\nY + L\u22121\n\u2212b\u22a4\n\nY bub\u22a4\nu L\u22121\n\nY /du\n\nu L\u22121\n\nY /du \u2212L\u22121\nY bu/du\ndu\n\n(cid:19) ,\n\n(3)\n\nu L\u22121\n\nY bu is the Schur complement of LY . Since L\u22121\n\nwhere du = cu \u2212 b\u22a4\nY is already given, computing\neach block of the new inverse matrix costs O(|Y |2), which is an order faster than the O(|Y |3)\ncomplexity required by the determinant. Moreover, only half of the entries may be computed, due\nto symmetry.\n\nNext, consider the case when an element u is removed (\u2018else\u2019 clause) from the current set Y . The\nmatrix to be updated is L\u22121\nY \\{u}, and is given by the rank-1 update formula. We \ufb01rst represent the\ncurrent inverse L\u22121\n\nY as\n\nL\u22121\n\nY =(cid:18)LY \\{u}\n\nb\u22a4\nu\n\nbu\n\ncu(cid:19)\u22121\n\n,(cid:18) D e\ne\u22a4 f(cid:19) ,\n\nwhere D \u2208 R(|Y |\u22121)\u00d7(|Y |\u22121), e \u2208 R|Y |\u22121, and f \u2208 R. Then, the inverse of the submatrix LY \\{u}\nis given by\n\nL\u22121\nY \\{u} = D \u2212\n\nee\u22a4\nf\n\n.\n\n(4)\n\nAgain, updating L\u22121\n\nY \\{u} only requires matrix arithmetic, and hence costs O(|Y |2).\n\nHowever, the initial L\u22121\nY , from which all subsequent inverses are updated, must be computed in full\nat the beginning of the chain. This complexity can be reduced by restricting the size of the initial Y .\nThat is, we \ufb01rst randomly initialize Y with a small size, say o(n1/3), and compute the inverse L\u22121\nY .\nAs we proceed with the chain, update L\u22121\nY using Equations 3 and 4 when an insertion or a deletion\nproposal is accepted, respectively. Therefore, the average complexity of acquiring a sample from a\ndistribution that is \u01eb-close to PL is O(T 2n log(n/\u01eb)), where T is the average size of Y encountered\nduring the progress of chain. In Section 3, we introduce a scheme to maintain a small-sized Y .\n\n2.2 Extension to k-DPPs\n\nThe idea of constructing a Markov chain to obtain a sample can be extended to k-DPPs. The only\nknown algorithm so far for sampling from a k-DPP also requires the eigen-decomposition of L.\nExtending the previous idea, we provide a Markov chain sampling algorithm similar to Algorithm 1\nthat samples from P k\nL.\nThe main idea behind the k-DPP chain is to propose a new con\ufb01guration by choosing two elements:\none to remove from the current set, and another to add. We accept this move according to the\nprobability de\ufb01ned by the ratio of the proposed determinant to the current determinant. This is\nequivalent to selecting a row and column of LX, and replacing it with the ones corresponding to the\nelement to be added. i.e., for X = Y \u222a {u}\nbu\n\nbv\n\nLX=Y \u222a{u} =(cid:18)LY\n\nb\u22a4\nu\n\ncu(cid:19) \u21d2 LX \u2032=Y \u222a{v} =(cid:18)LY\n\nb\u22a4\nv\n\ncv(cid:19) ,\n\nwhere u and v are the elements being removed and added, respectively. Following Equation 2, we\nset the transition probability as the ratio of the determinants of the two matrices.\n\ndet(LX \u2032)\ndet(LX )\n\n=\n\nThe \ufb01nal procedure is given in Algorithm 2.\n\ncv \u2212 b\u22a4\ncu \u2212 b\u22a4\n\nv L\u22121\nu L\u22121\n\nY bv\nY bu\n\n.\n\nSimilarly to Algorithm 1, we present the analysis on the stationary distribution and the mixing time\nof Algorithm 2.\nProposition 2. The Markov chain in Algorithm 2 has a stationary distribution P k\nL.\n\n4\n\n\fAlgorithm 2 Markov chain for sampling from P k\nL\nRequire: Itemset S = [n], similarity matrix L \u227b 0\n\nRandomly initialize state X \u2286 S, s.t. |X| = k\nwhile Not mixed do\n\nSample u \u2208 X, and v \u2208 S \\ X u.a.r.\nLetting Y = X \\ {u}, set\n\np \u2190 min(cid:26)1,\n\ncv \u2212 b\u22a4\ncu \u2212 b\u22a4\n\nv L\u22121\nu L\u22121\n\nY bv\n\nY bu(cid:27) .\n\n(5)\n\nX \u2190 Y \u222a {v} with prob. p\n\nend while\nreturn X\n\nTheorem 2. The Markov chain in Algorithm 2 has a mixing time \u03c4 (\u01eb) = O(k log(k/\u01eb)).\n\nThe main computational bottleneck of Algorithm 2 is the inversion of LY . Since LY is a (k \u2212 1) \u00d7\n(k\u22121) matrix, the per-iteration cost is O(k3). However, this complexity can be reduced by applying\nEquation 3 on every iteration to update the inverse. This leads to the \ufb01nal sampling complexity of\nO(k3 log(k/\u01eb)), which dominates the O(k3) cost of computing the intitial inverse, for acquiring a\nsingle sample from the chain. In many cases, k will be a constant much smaller than n, so our\nalgorithm is ef\ufb01cient in general.\n\n3 Application to Clustering\n\nk\n\n2\n\n(6)\n\nDC =\n\nkx \u2212 mik2\n\ni=1, the goal of clustering is to construct a partition C =\n\nof the points of Ci. i.e., mi = (Px\u2208Ci\n\nFinally, we show how our algorithms lead to an ef\ufb01cient heuristic for a k-means clustering problem\nwhen the number of clusters is not known. First, we brie\ufb02y overview the k-means problem.\nGiven a set of points P = {xi \u2208 Rd}n\n{C1,\u00b7\u00b7\u00b7 , Ck|Ci \u2286 P} of P such that the distortion\nXi=1 Xx\u2208Ci\nis minimized, where mi is the centroid of cluster Ci. It is known that the optimal centroid is the mean\nx)/|Ci|. Iteratively minimizing this expression converges\nto a local optimum, and is hence the preferred approach in many works. However, determining the\nnumber of clusters k is the factor that makes this problem NP-hard [7]. Note that the problem of\nunknown k prevails in other types of clustering algorithm, such as kernel k-means [5] and spectral\nclustering [8]: Kernel k-means is exactly the same as regular k-means except that the inner-products\nare substituted with a positive semi-de\ufb01nite kernel function, and spectral clustering uses regular\nk-means clustering as a subroutine. Some common techniques to determine k include performing\na density-based analysis of the data [9], or selecting k that minimizes the Bayesian information\ncriterion (BIC) [10].\nIn this work, we propose to sample the initial centroids of the clustering via our DPP sampling\nalgorithms. The similarity matrix L will be the Gram matrix determined by L(i, j) = \u03ba(xi, xj),\nwhere \u03ba(\u00b7) is simply the inner-product for regular k-means, and a speci\ufb01ed kernel function for\nkernel k-means. Since DPPs naturally capture the notion of diversity, the sampled points will tend\nto be more diverse, and thus serve better as initial representatives for each cluster. Once we have a\nsample, we perform a Voronoi partition around the elements of the sample to obtain a clustering3.\nNote that it is not necessary to determine k beforehand as it can be obtained from the DPP samples.\nThis approach is closely related to the MAP inference problem for DPPs [11], which is known to be\nNP-Hard as well. We use the proposed algorithms to sample the representative points that have high\nprobability under PL, and cluster the rest of the points around the sample. Subsequently, standard\n(kernel) k-means algorithms can be applied to improve this initial clustering.\n\nkernel \u03ba\n\n3The distance between x and y is de\ufb01ned asp \u03ba(x, x) \u2212 2\u03ba(x, y) + \u03ba(y, y), for any positive semi-de\ufb01nite\n\n5\n\n\fSince DPPs model both size and diversity, it seems that we could simply collect samples from\nAlgorithm 1 directly, and use those samples as representatives. However, as pointed out by [2],\nmodeling both properties simultaneously can negatively bias the quality of diversity. To reduce this\npossible negative in\ufb02uence, we adopt a two-step sampling strategy: First, we gather C samples from\nAlgorithm 1 and construct a histogram H over the sizes of the samples. Next, we sample from\nk-DPPs, by Algorithm 2, on a k acquired from H. This last sample is the representatives we use to\ncluster.\n\nAnother problem we may encounter in this scheme is the sensitivity to outliers. The presence of an\noutlier in P can cause the DPP in the \ufb01rst phase to favor the inclusion of that outlier, resulting in a\npossibly bad clustering. To make our approach more robust to outliers, we introduce the following\ncardinality-penalized DPP:\n\nPL;\u03bb(Y = Y ) \u221d exp(tr(log(LY )) \u2212 \u03bb|Y |) =\n\ndet(LY )\nexp(\u03bb|Y |)\n\n,\n\nwhere \u03bb \u2265 0 is a hyper-parameter that controls the weight to be put on |Y |. This regularization\nscheme smoothes the original PL by exponentially discounting the size of Y \u2019s. This does not in-\ncrease the order of the mixing time of the induced chain, since only a constant factor of exp(\u00b1\u03bb) is\nmultiplied to the transition probabilities. Algorithm 3 describes the overall procedure of our DPP-\nbased clustering.\n\nAlgorithm 3 DPP-based Clustering\nRequire: L \u227b 0, \u03bb \u2265 0, R > 0, C > 0\n\nGather {S1,\u00b7\u00b7\u00b7 , SC|Si \u223c PL;\u03bb} (Algorithm 1)\nConstruct histogram H = {|Si|}C\nfor j = 1,\u00b7\u00b7\u00b7 , R do\nSample Mj \u223c P kj\nVoronoi partition around Mj\nend for\nreturn clustering with lowest distortion (Equation 6)\n\nL (Algorithm 2), where kj \u223c H\n\ni=1 on the sizes of Si\u2019s\n\nChoosing the right value of \u03bb usually requires a priori knowledge of the data set. Since this informa-\ntion is not always available, one may use a small subset of P to heuristically choose \u03bb. For example,\nexamine the BIC of the initial clustering with respect to the centroids sampled from O(\u221an) ran-\ndomly chosen elements P \u2032 \u2282 P , with \u03bb = 0. Then, increase \u03bb by 1 until we encounter the point\nwhere the BIC hits the local maximum to choose the \ufb01nal value. An additional binary search step\nmay be used between \u03bb and \u03bb + 1 to further \ufb01ne-tune its value. Because we only use O(\u221an) points\nto sample from the DPP, each search step has at most linear complexity, allowing for ample time\nto choose better \u03bb\u2019s. This procedure may not appear to have an apparent advantage over a standard\nBIC-based model selection to choose the number of clusters k. However, tuning \u03bb not only allows\none to determine k, but also gives better initial partitions in terms of distortion.\n\n4 Experiments\n\nIn this section, we empirically demonstrate how our proposed method, denoted DPP-MC, of choos-\ning an initial clustering compares to other methods, in terms of distortion and running time. The\nmethods we compare against include:\n\nalgorithm of [11].\n\n\u2022 DPP-Full: Sample using full DPP sampling procedure as given in [4].\n\u2022 DPP-MAP: Sample the initial centroids according to the MAP con\ufb01guration, using the\n\u2022 KKM: Plain kernel k-means clustering given by [5], run on the \u201ctrue\u201d number of clusters.\nDPP-Full and DPP-MAP were used only in the \ufb01rst phase of Algorithm 3. To summarize the testing\nprocedure, DPP-MC, DPP-Full, DPP-MAP were used to choose the initial centroids. After this\ninitialization, KKM was carried out to improve the initial partitioning. Hence, the only difference\nbetween the algorithms tested and KKM is the initialization.\n\n6\n\n\fThe real-world data sets we use are the letter recognition data set [12] (LET), and a subset of the\npower consumption data set [13] (PWC), The LET set is represented as 10,000 points in R16, and\nthe PWC set 10,000 points in R7. While the LET set has 26 ground-truth clusters, the PWC set is\nonly labeled with timestamps. Hence, we manually divided all points into four clusters, based on\nthe month of timestamps. Since this partitioning is not the ground truth given by the data collector,\nwe expected the KKM algorithm to perform badly on this set.\nIn addition, we also tested our algorithm on an arti\ufb01cially-generated set consisting of 15,000 points\nin R10 from \ufb01ve mixtures of Gaussians (MG). However, this task is made challenging by roughly\nmerging the \ufb01ve Gaussians so that it is more likely to discover fewer clusters. The purpose of this set\nis to examine how well our algorithm \ufb01nds the appropriate number of clusters. For the MG set, we\npresent the result of DBSCAN [9]: another clustering algorithm that does not require k beforehand.\nWe used a simple polynomial kernel of the form \u03ba(x, y) = (x \u00b7 y + 0.05)3 for the real-world data\nsets, and a dot product for the arti\ufb01cial set. Algorithm 3 was run with \u03c41 = n log(n/0.01) and\n\u03c42 = k log(k/0.01) mixing steps for \ufb01rst and second phases, respectively, and C = R = 10.\nThe running time of our algorithm includes the time taken to heuristically search for \u03bb using the\nfollowing BIC [14]:\n\nBICk , Xx\u2208P\n\nlog Pr(x|{mi}k\n\ni=1, \u03c3) \u2212\n\nkd\n2\n\nlog n,\n\nwhere \u03c3 is the average of each cluster\u2019s distortion, and d is the dimension of the data set. The tuning\nprocedure is the same as the one given at the end of the previous section, without using binary\nsearch.\n\n4.1 Real-World Data Sets\n\nThe plots of the distortion and time for the LET set over the clustering iterations are given in\nFigure 1. Recall that KKM was run with the true number of clusters as its input, so one may expect it\nto perform relatively better, in terms of distortion and running time, than the other algorithms, which\nmust compute k. The plots show that this is indeed the case, with our DPP-MC outperforming its\ncompetitors. Both DPP-Full and DPP-MAP require long running time for the eigen-decomposition\nof the similarity matrix.\nIt is interesting to note that DPP-MAP does not perform better than a\nplain DPP-Full. We conjecture that this phenomenon is due to the approximate nature of the MAP\ninference.\n\n4.5\n\n4\n\n3.5\n\n)\n\n4\n\n0\n1\n\u00d7\n(\n \n\n \n\nn\no\ni\nt\nr\no\nt\ns\ni\nD\n\nDPP\u2212MC\nKKM\nDPP\u2212Full\nDPP\u2212MAP\n\n3\n\n \n1\n\n2\n\n3\n\n4\n\n5\n6\nIterations\n\n7\n\n8\n\n9\n\n10\n\n \n\n3500\n\n \n\n3000\n\n)\n.\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n \ne\nv\ni\nt\na\nl\nu\nm\nu\nC\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n \n1\n\nDPP\u2212MC\nKKM\nDPP\u2212Full\nDPP\u2212MAP\n\n2\n\n3\n\n4\n\n5\n\nIterations\n\n6\n\n7\n\n8\n\n9\n\nFigure 1: Distortion (left) and cumulated runtime (right) of the clustering induced by the competing\nalgorithms on the LET set.\n\nIn Table 1, we give a summary of the DPP-based initialization procedures. The reported values are\nthe immediate results of the initialization. For DPP-MC, the running time includes the automated \u03bb\ntuning. Taking this fact into account, DPP-MC was able to recover the true value of k quickly.\n\nIn Figure 2, we show the same results on the PWC set. As in the previous case, DPP-MC exhibits\nthe lowest distortion with the fastest running time. For this set, we have omitted the results for DPP-\n\n7\n\n\fDPP-MC DPP-Full DPP-MAP DPP-MC DPP-Full DPP-MAP\n\nDistortion\nTime (sec.)\n\nk\n\u03bb\n\n36020\n\n20\n26\n2\n\n42841\n820\n6\n-\n\n43719\n2850\n16\n-\n\n9.78\n15\n13\n4\n\n20.15\n\n50\n6\n-\n\n150\n220\n1\n-\n\nTable 1: Comparison among the DPP-based initializations for the LET set (left) and the PWC set\n(right).\n\nMAP, as it yielded a degenereate result of a single cluster. Nevertheless, we give the \ufb01nal result in\nTable 1.\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nn\no\ni\nt\nr\no\nt\ns\ni\nD\n\n \n\n1400\n\n \n\nDPP\u2212MC\nKKM\nDPP\u2212Full\n\n)\n.\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n \ne\nv\ni\nt\na\nl\nu\nm\nu\nC\n\n1200\n\n1000\n\n800\n\n600\n\n400\n\n200\n\nDPP\u2212MC\nKKM\nDPP\u2212Full\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n5\n\nIterations\n\n6\n\n7\n\n8\n\n9\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n5\n\nIterations\n\n6\n\n7\n\n8\n\n9\n\nFigure 2: Distortion (left) and time (right) of the clustering induced by the competing algorithms on\nthe PWC set.\n\n4.2 Arti\ufb01cial Data Set\n\nFinally, we present results on clustering the arti\ufb01cial MG set. In this set, we compare our algorithm\nwith another clustering algorithm DBSCAN that does not require k a priori. Due to page constraints,\nwe summarize the result in Table 2.\n\nDPP-MC DBSCAN\n\nDistortion\nTime (sec.)\n\nk\n\n6.127\n416\n34\n\n35.967\n\n60\n1\n\nTable 2: Comparison among the DPP-based initializations for the PWC set.\n\nDue to the merged con\ufb01guration of the MG set, DBSCAN is not able to successfuly discover multi-\nple clusters, and ends up with a singleton clustering. On the other hand, DPP-MC managed to \ufb01nd\nmany distinct clusters in a way the distortion is lowered.\n5 Discussion and Future Works\nWe have proposed a fast method for sampling from an \u01eb-close DPP distribution and its application to\nkernelized clustering. Although the exact computational complexity of sampling (O(T 2n log(n/\u01eb))\nis not explicitly superior to the previous approach (O(n3)), we emperically show that T is generally\nsmall enough to account for our algorithm\u2019s ef\ufb01ciency. Furthermore, the extension to k-DPP\nsampling yields very fast speed-up compared to the previous sampling algorithm.\nHowever, one must keep in mind that the mixing time analysis is for a single sample only: i.e., we\nmust mix the chain for each sample we need. For a small number of samples, this may compensate\nfor the cubic complexity of the previous approach. For a larger number of samples, we must further\n\n8\n\n\finvestigate the effect of sample correlation after mixing in order to prove long-term ef\ufb01ciency.\n\nReferences\n\n[1] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. ArXiv, 2012.\n[2] A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings\n\nof ICML, 2011.\n\n[3] A. Kulesza and B. Taskar. Learning determinantal point processes. In Proceedings of UAI,\n\n2011.\n\n[4] J.B. Hough, M. Krishnapur, Y. Peres, and B. Vir\u00b4ag. Determinantal processes and independence.\n\nProbability Surveys, 3, 2006.\n\n[5] I. Dhillon, Y. Guan, and B. Kulis. Kernel k-means, spectral clustering and normalized cuts. In\n\nProceedings of ACM SIGKDD, 2004.\n\n[6] G. Golub and C. van Loan. Matrix Computations. Johns Hopkins University Press, 1996.\n[7] A. Daniel, D. Amit, H. Pierre, and P. Preyas. NP-hardness of euclidean sum-of-squares clus-\n\ntering. Machine Learning, 75:245\u2013248, 2009.\n\n[8] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In\n\nProceedings of NIPS, 2001.\n\n[9] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters\n\nin large spatial databases with noise. In Proceedings of KDD, 1996.\n\n[10] C. Fraley and A. E. Raftery. How many clusters? which clustering method? answers via\n\nmodel-based cluster analysis. The Computer Journal, 41(8), 1998.\n\n[11] J. Gillenwater, A. Kulesza, and B. Taskar. Near-optimal MAP inference for determinantal\n\npoint processes. In Proceedings of NIPS, 2012.\n\n[12] D. Slate.\n\nhttp://archive.ics.uci.edu/ml/\n\nLetter\n\nrecognition data set.\n\ndatasets/Letter+Recognition, 1991.\n\n[13] G. H\u00b4ebrail and A. B\u00b4erard.\n\nIndividual household electric power consumption data set.\n\nhttp://archive.ics.uci.edu/ml/datasets/Individual+household+\nelectric+power+consumption, 2012.\n\n[14] C. Goutte, L. K. Hansen, M. G. Liptrot, and E. Rostrup. Feature-space clustering for fMRI\n\nmeta-analysis. Human Brain Mapping, 13, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1112, "authors": [{"given_name": "Byungkon", "family_name": "Kang", "institution": "Samsung Electronics"}]}