{"title": "q-means: A quantum algorithm for unsupervised machine learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4134, "page_last": 4144, "abstract": "Quantum information is a promising new paradigm for fast computations that can provide substantial speedups for many algorithms we use today. Among them, quantum machine learning is one of the most exciting applications of quantum computers. In this paper, we introduce q-means, a new quantum algorithm for clustering. It is a quantum version of a robust k-means algorithm, with similar convergence and precision guarantees. We also design a method to pick the initial centroids equivalent to the classical k-means++ method. Our algorithm provides currently an exponential speedup in the number of points of the dataset, compared to the classical k-means algorithm. We also detail the running time of q-means when applied to well-clusterable datasets. We provide a detailed runtime analysis and numerical simulations for specific datasets. Along with the algorithm, the theorems and tools introduced in this paper can be reused for various applications in quantum machine learning.", "full_text": "q-means: A quantum algorithm for unsupervised\n\nmachine learning\n\nCNRS, IRIF, Universit\u00e9 Paris Diderot, Paris, France\n\nIordanis Kerenidis\n\njkeren@irif.fr\n\nJonas Landman\n\nCNRS, IRIF, Universite\u00e9 Paris Diderot, Paris, France\n\nEcole Polytechnique, Palaiseau, France. landman@irif.fr\n\nAlessandro Luongo\n\nCNRS, IRIF, Universite\u00e9 Paris Diderot, Paris, France\nAtos Quantum Lab - Les Clayes-sous-Bois, France\n\naluongo@irif.fr\n\nAnupam Prakash\n\nCNRS, IRIF, Universite\u00e9 Paris Diderot, Paris, France\n\nanupam.prakash@irif.fr\n\nAbstract\n\nQuantum information is a promising new paradigm for fast computations that can\nprovide substantial speedups for many algorithms we use today. Among them,\nquantum machine learning is one of the most exciting applications of quantum\ncomputers. In this paper, we introduce q-means, a new quantum algorithm for\nclustering. It is a quantum version of a robust k-means algorithm, with similar\nconvergence and precision guarantees. We also design a method to pick the initial\ncentroids equivalent to the classical k-means++ method. Our algorithm provides\ncurrently an exponential speedup in the number of points of the dataset, compared\nto the classical k-means algorithm. We also detail the running time of q-means\nwhen applied to well-clusterable datasets. We provide a detailed runtime analysis\nand numerical simulations for speci\ufb01c datasets. Along with the algorithm, the\ntheorems and tools introduced in this paper can be reused for various applications\nin quantum machine learning.\n\n1\n\nIntroduction\n\nAs the amount of data generated in our society is expected to grow faster than the growth in our\ncomputational capabilities, more powerful ways of processing information are needed. Quantum\ncomputation uses the fundamental properties of quantum physics to rede\ufb01ne the way computers create\nand manipulate information. These properties imply a radically new way of computing, using qubits\ninstead of bits, and give the possibility of obtaining quantum algorithms that could be substantially\nfaster than classical algorithms. In recent years, there have been proposals for quantum machine\nlearning algorithms that have the potential to offer considerable speedups over the corresponding\nclassical algorithms, either exponential or large polynomial speedups [28, 23, 22, 8, 27, 3]. Of\ncourse, in order to translate such theoretical results into advantages for real-world use cases one\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwould need both more advanced quantum hardware, which might be still years away, but also a\nclose collaboration between the classical and quantum machine learning communities in order to\nbetter understand when and how quantum algorithms can be used as a powerful tool within the larger\nmachine learning framework. In most of these quantum machine learning applications, there are\nsome common algorithmic primitives that are used to build the algorithms. For instance, quantum\nprocedures for linear algebra (matrix multiplication, inversion, and projections in sub-eigenspaces\nof matrices) have been used for recommendation systems or dimensionality reduction techniques\n[23, 21, 28]. Second, the ability to estimate distances between quantum states, for example through\nthe SWAP test, has been used for supervised or unsupervised learning [27, 36]. We note that most of\nthese procedures can be used either with quantum data or they need quantum access to the classical\ndata, which can be achieved by storing the data in speci\ufb01c data structures like a QRAM (Quantum\nRandom Access Memory).\nIn this paper, we propose q-means, a quantum algorithm for clustering, which can be viewed as a\nquantum analogue to the classical k-means algorithm. Since quantum computation is not deterministic\nand is also prone to noise, quantum machine learning has to incorporate a certain level of randomness.\nTherefore it is more precise to present q-means as a quantum equivalent of the \u03b4-k-means algorithm,\nwhich is a version of k-means with noise, introduced in this paper. We provide an analysis to show\nthat the output of q-means is consistent with the classical \u03b4-k-means algorithm and further that the\nrunning time depends poly-logarithmically on the number of elements in the dataset. The simplest\nversion of the k-means algorithm runs in time O(ndtk), where n is the number of elements in the\ntraining set, d is the number of features, t is the number of iterations, and k is the number of classes.\nPrevious work in quantum clustering exists [27, 31]. In our opinion, the biggest limitation of previous\nworks is that they do not allow to retrieve the classical \ufb01tted model out of the computation (i.e. in\nour case are the k centroids). Quantum-inspired classical algorithm for clustering exsists [20, 9].\nThese achieve polylogarithmic scaling in terms of numer of elements in the dataset, but are of a much\nhigher polynomial degree with respect to other parameters (like condition number, rank and error).\nAlgorithms based on these techniques have been implemented and benchmarked on real dataset with\ndiscouraging results [5], therefore we don\u2019t expect these results to change the set problem where\nquantum computers are expected to get an edge of advantege over classical computation. A complete\nreview of previous works is presented in Supplementary Material, Section A.1\n\n1.1 The k-means and \u03b4-k-means Algorithms\nThe input for the k-means algorithm [29] is a dataset V of vectors vi \u2208 Rd for i \u2208 [N ]. These points\nmust be partitioned in k subsets according to a similarity measure, e.g. the Euclidean distance. The\noutput of the k-means algorithm is a list of k cluster centers, which are called centroids. At iteration\nj for j \u2208 [k], and each corresponding centroid by the vector\nt, we denote the k clusters by the sets C t\nj) be the Euclidean distance between\nj. Each data point vi is assigned to one cluster C t\nct\nvectors vi and ct\nj. The algorithm starts by selecting k initial centroids and then alternates between\ntwo steps: (i) assign each vi a label (cid:96)(vi)t corresponding to the closest centroid, that is (cid:96)(vi)t =\nargminj\u2208[k](d(vi, ct\nvi. We\n) (cid:54) \u03c4. We now\nsay that we have converged if for a small threshold \u03c4 we have 1\nk\nintroduce \u03b4-k-means, that can be thought as a robust version of the k-means algorithm in which we\nintroduce some noise parametrized by \u03b4 > 0. The noise affects the algorithm in both steps of the\nk-means algorithm: label assignment and centroid estimation. As we will see in this work, q-means\nis the quantum analog of \u03b4-k-means, due to the noise and non deterministic character of quantum\ncomputations. Let c\u2217\ni be the closest centroid to the data point vi. In the assignment step, instead\nof choosing deterministically the label corresponding to the closest centroid, one label is randomly\nassigned among the followhing set:\n\nj)). (ii) update the centroids with the following rule: ct+1\nj=1 d(ct\n\nj = 1|Ct\nj|\nj, ct\u22121\n\nj. Let d(vi, ct\n\n(cid:80)k\n\n(cid:80)\n\nvi\u2208Ct\n\nj\n\nj\n\nL\u03b4(vi) = {p : |d2(c\u2217\n\ni , vi) \u2212 d2(cp, vi)| \u2264 \u03b4 }\nSecond, we add \u03b4/2 noise during the calculation of the centroid. Let Ct+1\nhave been labeled by j in the previous step. For \u03b4-k-means we pick a centroid ct+1\nd(ct+1\n\nbe the set of points which\nj with the property\n2 . We simulate this by adding small Gaussian noise to the centroid.\n\n(cid:80)\n\nvi) < \u03b4\n\nvi\u2208Ct+1\n\n(1)\n\n,\n\nj\n\nj\n\n1|Ct+1\n\n|\n\nj\n\nj\n\n2\n\n\fLet us add two remarks on the \u03b4-k-means. First, for a well-clusterable dataset (see Section 1.4) and\nfor a small \u03b4, the number of vectors on the boundary that risk to be misclassi\ufb01ed in each step, that is\nthe vectors for which |L\u03b4(vi)| > 1, is typically much smaller compared to the vectors that are close\nto a unique centroid. Second, we also increase by \u03b4/2 the convergence threshold from the k-means\nalgorithm. All in all, the \u03b4-k-means algorithm \ufb01nds a clustering that is robust when the data points\nand the centroids are perturbed with noise of magnitude O(\u03b4). Our numerical simulations show that\nthe performance of the \u03b4-k-means algorithm is similar to the k-means algorithm for small enough \u03b4\u2019s.\n\n1.2 Quantum Preliminaries\n\n(cid:80)\nWe assume a basic understanding of quantum computing, we recommend Nielsen and Chuang [30]\nfor an introduction to the subject. A vector v \u2208 Rd is encoded into a quantum state |v(cid:105) de\ufb01ned\nm\u2208[d] vm |m(cid:105), where |m(cid:105) represents em, the mth vector in the standard basis. A\nas |v(cid:105) = 1(cid:107)v(cid:107)\nquantum circuit or algorithm consists of unitary logic gates or measurements, and can be applied to a\nsuperposition of quantum states. We will assume at some steps that the data matrices V (datapoints)\nand C t (centroids at step t) are stored in suitable QRAM data structures which are described in [23].\nImportant quantum subroutines and theorems for this work are described in Supplementary Material,\nSection A.3.\n\n1.3 Our Results\n\nWe de\ufb01ne and analyse a new quantum algorithm for clustering, the q-means algorithm, whose running\ntime provides substantial savings, especially for the case of large data sets, and whose performance\nis similar to that of the classical \u03b4-k-means algorithm - a robust version of the k-means algorithm\nwe de\ufb01ned in this work - meaning that with high probability the clusters that the q-means algorithm\noutputs are also possible outputs for the \u03b4-k-means.\nThe q-means algorithm combines most of the advantages that quantum machine learning algorithms\ncan offer for clustering. First, the running time is poly-logarithmic in the number of elements of the\ndataset and depends only linearly on the dimension of the feature space. Second, q-means returns\nexplicit classical descriptions of the cluster centroids that are obtained by the \u03b4-k-means algorithm.\nAs the algorithm outputs a classical description of the centroids, it is possible to use them in further\n(classical or quantum) algorithms, unlike previous works on quantum k-means [27] that outputs\nquantum states corresponding to the centroids. We start by providing a worst case analysis of the\nrunning time of each step of our algorithm. The running time parameters include the maximum\nnorm of the dataset, the condition number and a parameter \u00b5 of the data point matrix (see de\ufb01nition\nin Theorem 3.1). While different than the classical case, these aspects are common in quantum\ncomputing [22], where the magnitude or the rank of the data point matrix can impact the ef\ufb01ciency of\n\nthe algorithm itself. Note that with (cid:101)O we hide polylogarithmic factors.\ntime (cid:101)O\n\nResult 1. Given dataset V \u2208 RN\u00d7d stored in QRAM, the q-means algorithm outputs with high\nprobability centroids c1,\u00b7\u00b7\u00b7 , ck that are consistent with an output of the \u03b4-k-means algorithm in\nper iteration, where \u03ba(V ) is the condition\nnumber, \u00b5(V ) is a parameter that appears in quantum linear algebra procedures and 1 \u2264 (cid:107)vi(cid:107)2 \u2264 \u03b7.\nWe also provide a speci\ufb01c running time analysis for a natural notion of well-clusterable datasets,\ngiven in the following section. See Theorem 3.2 for formal proof.\nResult 2. Given a well-clusterable dataset V \u2208 RN\u00d7d stored in QRAM, the q-means algorithm\noutputs with high probability k centroids c1,\u00b7\u00b7\u00b7 , ck that are consistent with the output of the \u03b4-k-\n\n\u03b4 ) + k2 \u03b71.5\n\n(cid:16)\n\nkd \u03b7\n\n\u03b42 \u03ba(V )(\u00b5(V ) + k \u03b7\n\n\u03b42 \u03ba(V )\u00b5(V )\n\n(cid:17)\n\nk2d \u03b72.5\n\n\u03b43 + k2.5 \u03b72\n\n\u03b43\n\nper iteration, where 1 \u2264 (cid:107)vi(cid:107)2 \u2264 \u03b7.\n\nmeans algorithm in time (cid:101)O\n\n(cid:16)\n\n(cid:17)\n\nThe parameter \u03b4 (which plays the same role as in the \u03b4-k-means) is expected to be a large enough\nconstant that depends on the data, and the parameter \u03b7 is again expected to be a small constant for\ndatasets whose data points have roughly the same norm. In high level, our algorithm is quadratic\non the number of clusters, linear in the dimension of points and only polylogarithmic in the number\nof data points. We present extensive simulations for different datasets and found that the number of\niterations is practically the same as in the k-means, and the \u03b4-k-means algorithm achieves an accuracy\nsimilar to the k-means algorithm, see Section 4.\n\n3\n\n\f1.4 Modelling Well-Clusterable Datasets\n\nWithout loss of generality we consider in the remaining of the paper that the dataset V is normalized\nso that for all i \u2208 [N ], we have 1 \u2264 (cid:107)vi(cid:107), and we de\ufb01ne the parameter \u03b7 = maxi (cid:107)vi(cid:107)2. We will\nalso assume that the number k is the \u201cright\u201d number of clusters, meaning that we assume each\ncluster has at least some \u2126(N/k) data points. We now propose a simple notion of well-clusterable\ndataset. The de\ufb01nition aims to capture some properties that we can expect from datasets that can be\nclustered ef\ufb01ciently using a k-means algorithm. Note that we do not need this assumption for our\ngeneral q-means algorithm, but in this model we can provide tighter bounds for its running time. Our\nde\ufb01nition of a well-clusterable dataset shares some similarity with the models made in [12], [25] but\nthey remain speci\ufb01c for our current problem.\nDe\ufb01nition 1 (Well-clusterable dataset). A data matrix V \u2208 RN\u00d7d with rows vi \u2208 Rd, i \u2208 [N ] is\nsaid to be well-clusterable if there exist constants \u03be, \u03b2 > 0, \u03bb \u2208 [0, 1], \u03b7 \u2264 1, and cluster centroids\nci for i \u2208 [k] such that:\n- (separation of cluster centroids): d(ci, cj) \u2265 \u03be \u2200i, j \u2208 [k]\n- (proximity to cluster centroid): At least \u03bbN points vi in the dataset satisfy d(vi, cl(vi)) \u2264 \u03b2 where\ncl(vi) is the centroid nearest to vi.\n- (Intra-cluster smaller than inter-cluster square distances): The following inequality is satis\ufb01ed\n\u221a\n4\n\n\u03b7(cid:112)\u03bb\u03b22 + (1 \u2212 \u03bb)4\u03b7 \u2264 \u03be2 \u2212 2\n\n\u221a\n\n\u03b7\u03b2.\n\nIntuitively, the assumptions guarantee that most of the data can be easily assigned to one of k clusters,\nsince these points are close to the centroids, and the centroids are suf\ufb01ciently far from each other.\nThe exact inequality comes from the error analysis, but in spirit it says that the intra-cluster distance\nmust be su\ufb01ciently smaller than the inter-cluster distance. A series of four claims are detailed in\nthe Supplementary Material, Section A.2 that provide mathematical properties of well-clusterable\ndatasets, and will be used in the proofs on the running time of the q-means applied to well-clusterable\ndatasets.\nAn overview of q-means algorithm is given as Algorithm 1.\n\n2 The q-means Algorithm\n\nAt a high level, the q-means algorithm follows the same steps as the classical k-means algorithm,\nwhere we now use quantum subroutines for distance estimation, \ufb01nding the minimum value among a\nset of elements, matrix multiplication for obtaining the new centroids as quantum states, and ef\ufb01cient\ntomography. First, we pick k random centroids, or we use our initialization procedure q-means++, an\nef\ufb01cient quantum equivalent of k-means++ (see Section 2.1). Then, in Steps 1 and 2 all data points\nare assigned to clusters in superposition and not one after the other, and in Steps 3 and 4 we update\nthe centroids of the clusters. The process is repeated until convergence.\n\n2 |0(cid:105) (|vi(cid:105) + |cj(cid:105)) + 1\n\nstate |i(cid:105)|j(cid:105)(cid:0) 1\n\n2 |1(cid:105) (|vi(cid:105) \u2212 |cj(cid:105))(cid:1). Since the probability of measuring 1 on the\n\nStep 1: Centroid Distance Estimation The \ufb01rst step of the algorithm estimates the square distance\nbetween all data points and centroids using a quantum procedure. From |i(cid:105)|j(cid:105)|0(cid:105), we create the\nthird register is proportional to (cid:104)vi|cj(cid:105), we can use the Amplitude Estimation circuit to extract this\nvalue in another quantum register. Several copies of this register can be taken to compute a median\nestimation that boost the probability of success. More details are provided in the Supplementary\nMaterial, Section A.4.1. The distance estimation becomes very ef\ufb01cient when we have quantum\naccess to the vectors and the centroids by querying the state preparation oracles with the QRAM\n|i(cid:105)|0(cid:105) (cid:55)\u2192 |i(cid:105)|vi(cid:105) , and |j(cid:105)|0(cid:105) (cid:55)\u2192 |j(cid:105)|cj(cid:105) in time T = O(log nd), as well as querying the norm of\nthese vectors. As quantum states have unit norm, we need to multiply their inner products by the real\nnorms (cid:107)vi(cid:107)(cid:107)cj(cid:107). If we have an absolute error \u0001 for the square distance estimation of the normalized\nvectors, then the \ufb01nal error is of the order of \u0001(cid:107)vi(cid:107)(cid:107)cj(cid:107). These computations are performed in\nsuperposition over all point indices |i(cid:105) and for a tensor product of all centroid indices |j(cid:105) at the same\ntime. It leads to the distance estimation theorem corresponding to Step 1 of q-means algorithm. We\ndevelop its proof in the Supplementary Material, Section A.4.1.\nTheorem 2.1 (Centroid Distance estimation). Let a data matrix V \u2208 RN\u00d7d and a centroid matrix\nC \u2208 Rk\u00d7d be stored in QRAM, such that the following unitaries |i(cid:105)|0(cid:105) (cid:55)\u2192 |i(cid:105)|vi(cid:105) , and |j(cid:105)|0(cid:105) (cid:55)\u2192\n\n4\n\n\fN(cid:88)\n\n1\u221a\nN\n\n(cid:80)N\n|j(cid:105)|cj(cid:105) can be performed in time O(log(N d)) and the norms of the vectors are known. For any \u2206 > 0\ni=1 |i(cid:105) \u2297j\u2208[k] (|j(cid:105)|0(cid:105)),\nand \u00011 > 0, there exists a quantum algorithm that, given the state\nperforms the mapping to\n\n1\u221a\nN\n\nwhere |d2(vi, cj) \u2212 d2(vi, cj)| (cid:54) \u00011 with probability at least 1 \u2212 2\u2206, in time (cid:101)O\n\ni=1\n\n|i(cid:105) \u2297j\u2208[k] (|j(cid:105)|d2(vi, cj)(cid:105)),\n\n(cid:16)\n\nk \u03b7 log(\u2206\u22121)\n\n\u00011\n\n(cid:17)\n\n(2)\n\nwhere\n\n\u03b7 = maxi((cid:107)vi(cid:107)2).\n\nStep 2: Cluster Assignment At the end of step 1, we have coherently estimated the square\ndistance between each point in the dataset and the k centroids in separate registers. We can now\nselect the index j that corresponds to the centroid closest to the given data point, written as (cid:96)(vi) =\nargminj\u2208[k](d(vi, cj)). As the square is a monotone function, we do not need to compute the square\nroot of the distance in order to \ufb01nd (cid:96)(vi).\nLemma 2.2 (Circuit for \ufb01nding the minimum). Given k different log p-bit registers \u2297j\u2208[k] |aj(cid:105),\nthere is a quantum circuit Umin that maps (\u2297j\u2208[k] |aj(cid:105))|0(cid:105) \u2192 (\u2297j\u2208[k] |aj(cid:105))|argmin(aj)(cid:105) in time\nO(k log p).\nProof. We append an additional register for the result that is initialized to |1(cid:105). We then repeat the\nfollowing operation for 2 \u2264 j \u2264 k, we compare registers 1 and j, if the value in register j is\nsmaller we swap registers 1 and j and update the result register to j. The cost of the procedure is\nO(k log p).\n\nThe cost of \ufb01nding the minimum is (cid:101)O(k) in step 2 of the q-means algorithm, while we also need to\n\nuncompute the distances by repeating Step 1. Once we apply the minimum \ufb01nding Lemma 2.2 and\nundo the computation we obtain the state\n\nN(cid:88)\n\ni=1\n\n|\u03c8t(cid:105) :=\n\n1\u221a\nN\n\n|i(cid:105)|(cid:96)t(vi)(cid:105) .\n\n(3)\n\nIn high level Steps 1 and 2 have assigned labels to all data points in superposition. Note that this state\ndoes not allow us to read out all possible labels, but it contains exactly the information we need in\n(cid:80)N\norder to estimate the new centroids in the following step.\nStep 3: Centroid state creation The previous step gave us the state |\u03c8t(cid:105) = 1\u221a\ni=1 |i(cid:105)|(cid:96)t(vi)(cid:105).\nThe \ufb01rst register of this state stores the index of the data points while the second register stores the\nlabel for the data point in the current iteration. Given these states, we need to \ufb01nd the new centroids\nj \u2208 RN be the\n|ct+1\ncharacteristic vector for cluster j \u2208 [k] at iteration t scaled to unit (cid:96)1 norm, that is (\u03c7t\nj| if\ni \u2208 Cj and 0 if i (cid:54)\u2208 Cj. The creation of the quantum states corresponding to the centroids is based on\nthe following simple claim.\nj \u2208 RN be the scaled characteristic vector for Cj at iteration t and V \u2208 RN\u00d7d be\nClaim 2.3. Let \u03c7t\nthe data matrix, then ct+1\n\n(cid:105), which are the barycenters of the data points having the same label. Let \u03c7t\n\nj)i = 1|Ct\n\nN\n\nj\n\nj.\nj = V T \u03c7t\n\nThe above claim allows us to compute the updated centroids ct+1\nusing quantum linear algebra\noperations. In fact, the state |\u03c8t(cid:105) can be written as a weighted superposition of the characteristic\nvectors of the clusters.\n\nj\n\n(cid:114)|Cj|\n\nk(cid:88)\n\nN\n\nj=1\n\n\uf8eb\uf8ed 1(cid:112)|Cj|\n\n\uf8f6\uf8f8|j(cid:105) =\n\n|i(cid:105)\n\n(cid:114)|Cj|\n\nk(cid:88)\n\nN\n\nj=1\n\n|\u03c8t(cid:105) =\n\n|\u03c7t\nj(cid:105)|j(cid:105)\n\n(4)\n\n(cid:88)\n\ni\u2208Cj\n\n5\n\n\fWe can then measure the label register |j(cid:105). The running time of this step is derived from Theorem\nj(cid:105) is the time of Steps 1 and\nA.8, in Supplementary Material, where the time to prepare the state |\u03c7t\n2. Note that we do not have to add an extra k factor due to the sampling, since we can run the\nmatrix multiplication procedures in parallel for all j so that every time we measure a random |\u03c7t\nj(cid:105)\nwe perform one more step of the corresponding matrix multiplication. Assuming that all clusters\nhave size \u2126(N/k) we will have an extra factor of O(log k) in the running time by a standard coupon\ncollector argument.\n\nj\n\nj\n\n(cid:105) a total of O( d log d\n\nStep 4: Centroid update\nIn Step 4, we need to go from quantum states corresponding to the\ncentroids, to a classical description of the centroids in order to perform the update step. For this, we\nwill apply the vector state tomography algorithm, stated in Theorem A.9, in Supplementary Material,\non the states |ct+1\n(cid:105) that we create in Step 3. Note that for each j \u2208 [k] we will need to invoke\nthe unitary that creates the states |ct+1\n) times for achieving (cid:107)|cj(cid:105) \u2212 |cj(cid:105)(cid:107) < \u00014.\nHence, for performing the tomography of all clusters, we will invoke the unitary O( k(log k)d(log d)\n)\ntimes where the O(k log k) term is the time to get a copy of each centroid state. The vector\nstate tomography gives us a classical estimate of the unit norm centroids within error \u00014, that is\n(cid:107)|cj(cid:105) \u2212 |cj(cid:105)(cid:107) < \u00014. Using the approximation of the norms (cid:107)cj(cid:107) with relative error \u00013 from Step 3,\nwe can combine these estimates to recover the centroids as vectors. The analysis is described in the\nfollowing claim, whose proof can be found in the Supplementary Material:\nClaim 2.4. Let \u00014 be the error we commit in estimating |cj(cid:105) such that (cid:107)|cj(cid:105) \u2212 |cj(cid:105)(cid:107) < \u00014, and\n\u00013 the error we commit in the estimating the norms, |(cid:107)cj(cid:107) \u2212 (cid:107)cj(cid:107)| \u2264 \u00013 (cid:107)cj(cid:107). Then (cid:107)cj \u2212 cj(cid:107) \u2264\n\u221a\n\u03b7(\u00013 + \u00014) = \u0001centroid.\n\n\u00012\n4\n\n\u00012\n4\n\n2.1\n\nInitialization: q-means++\n\nThe k-means++ technique [6] is frequently used for initializing the classical k-means algorithm.\nThe \ufb01rst centroid is chosen uniformly at random. We sample the next centroid from a probability\ndistribution where the probability of sampling vi is proportional to the squared distance to the\nclosest centroid. We add the sampled point to the list of the already chosen centroids, and repeat the\nprocedure until k centroids have been chosen. Note that when more than one centroids are already\npicked, the sampling probability is proportional to the squared distance to the closest centroid. In the\nSupplementary Material (Section A.5) we prove the following theorem:\nTheorem 2.5. Let the data matrix V \u2208 V \u2208 RN\u00d7d be stored in the QRAM. There exists a quantum\nalgorithm that returns a matrix C \u2208 Rk\u00d7d consistent with the centroids returned by the k-means++\ninitialization algorithm in time\n\n(cid:32)\n\nO\n\nk2\n\n(cid:33)\n\n(cid:112)E(d2(vi, vj))\n\n2\u03b71.5\n\n\u00011\n\n,\n\n(5)\n\nwhere E(d2(vi, vj)) is the average squared distance between two points of the dataset.\n\n3 Analysis\n\n(cid:16)\n\n(cid:101)O\n\nkd \u03b7\n\nWe provide our general theorem about the running time and accuracy of the q-means algorithm.\nTheorem 3.1 (q-means). For a data matrix V \u2208 RN\u00d7d and parameter \u03b4 > 0, the q-means algorithm\nwith high probability outputs centroids consistent with the classical \u03b4-k-means algorithm, in time\nper iteration, where 1 \u2264 (cid:107)vi(cid:107)2 \u2264 \u03b7, \u03ba(V ) is the\n, where P \u2282 [0, 1] such\n\n(cid:17)\n(cid:16)(cid:107)V (cid:107)F ,\n\n\u03b42 \u03ba(V )(\u00b5(V ) + k \u03b7\n\ns2p(V )s2(1\u2212p)(V T )\n\n\u03b4 ) + k2 \u03b71.5\n\n\u03b42 \u03ba(V )\u00b5(V )\n\n(cid:113)\n\n(cid:17)\n\ncondition number, and \u00b5(V ) = minp\u2208P\nthat |P| = O(1) and sp(V ) := maxi\u2208[N ] (cid:107)Vi(cid:107)p\n\np\n\nWe prove this theorem in Sections 3.1 and 3.2 and then provide the running time of the algorithm for\nwell-clusterable datasets as Theorem 3.2.\n\n6\n\n\fAlgorithm 1 q-means.\nRequire: Data matrix V \u2208 RN\u00d7d stored in QRAM data structure. Precision parameters \u03b4 for\nk-means, error parameters \u00011 for distance estimation, \u00012 and \u00013 for matrix multiplication and \u00014 for\ntomography.\nEnsure: Outputs vectors c1, c2,\u00b7\u00b7\u00b7 , ck \u2208 Rd that correspond to the centroids at the \ufb01nal step of the\n\n\u03b4-k-means algorithm.\nSelect k initial centroids c0\nt=0\nrepeat\n\n1,\u00b7\u00b7\u00b7 , c0\n\nk and store them in QRAM data structure.\n\nj)(cid:105)\n|i(cid:105) \u2297j\u2208[k] |j(cid:105)|d2(vi, ct\n\n(6)\n\nN(cid:88)\n\ni=1\n\n1\u221a\nN\n\nStep 1: Centroid Distance Estimation\nPerform the mapping (Theorem 2.1)\n\n|i(cid:105) \u2297j\u2208[k] |j(cid:105)|0(cid:105) (cid:55)\u2192 1\u221a\nN\n\nN(cid:88)\nwhere |d2(vi, ct\nj) \u2212 d2(vi, ct\nStep 2: Cluster Assignment\nFind the minimum distance among {d2(vi, ct\nN(cid:88)\ncreate the superposition of all points and their labels\n\nj)| \u2264 \u00011.\n\n|i(cid:105) \u2297j\u2208[k] |j(cid:105)|d2(vi, ct\n\ni=1\n\n1\u221a\nN\n\ni=1\n\nj)(cid:105) (cid:55)\u2192 1\u221a\nN\n\nj)}j\u2208[k] (Lemma 2.2), then uncompute Step 1 to\n\ni=1\n\n|i(cid:105)|(cid:96)t(vi)(cid:105)\n\nN(cid:88)\n(cid:80)\nj|\n|Ct\ni\u2208Ct\nN\nj(cid:105) to obtain the state |ct+1\n\n|i(cid:105), with prob.\n\nj\n\n(7)\n\n(cid:105) with\n\nStep 3: Centroid states creation\n3.1 Measure the label register to obtain a state |\u03c7t\n3.2 Perform matrix multiplication with matrix V T and vector |\u03c7t\n\nerror \u00012, along with an estimation of(cid:13)(cid:13)ct+1\n\n(cid:13)(cid:13) with relative error \u00013 (Theorem A.8).\n\nj(cid:105) = 1\u221a|Ct\nj|\n\nj\n\nj\n\nStep 4: Centroid Update\n4.1 Perform tomography for the states |ct+1\n(Theorem A.9) and get a classical estimate ct+1\n\u221a\n4.2 Update the QRAM data structure for the centroids with the new vectors ct+1\nt=t+1\n\n(cid:105) with precision \u00014 using the operation from Steps 1-3\n| \u2264\n\nfor the new centroids such that |ct+1\n\nj \u2212 ct+1\n\u00b7\u00b7\u00b7 ct+1\n\n\u03b7(\u00013 + \u00014) = \u0001centroids\n\n.\n\nk\n\n0\n\nj\n\nj\n\nj\n\nuntil convergence condition is satis\ufb01ed.\n\n3.1 Error analysis\n\nIn this section we determine the error parameters in the different steps of the quantum algorithm so\nthat the quantum algorithm behaves the same as the classical \u03b4-k-means. More precisely, we will\ndetermine the values of the errors \u00011, \u00012, \u00013, \u00014 in terms of \u03b4 so that \ufb01rstly, the cluster assignment of\nall data points made by the q-means algorithm is consistent with a classical run of the \u03b4-k-means\nalgorithm, and also that the centroids computed by the q-means after each iteration are again consistent\nwith centroids that can be returned by the \u03b4-k-means algorithm. The cluster assignment in q-means\nhappens in two steps. The \ufb01rst step estimates the square distances between all points and all centroids.\nThe error in this procedure is of the form: |d2(cj, vi) \u2212 d2(cj, vi)| < \u00011. for a point vi and a centroid\ncj. The second step \ufb01nds the minimum of these distances without adding any error. For the q-means\nto output a cluster assignment consistent with the \u03b4-k-means algorithm, we require that:\n\n\u2200j \u2208 [k],\n\n|d2(cj, vi) \u2212 d2(cj, vi)| \u2264 \u03b4\n2\n\n(8)\n\nwhich implies that no centroid with distance more than \u03b4 above the minimum distance can be chosen\nby the q-means algorithm as the label. Thus we need to take \u00011 < \u03b4/2. After the cluster assignment\nof the q-means (which happens in superposition), we update the clusters, by \ufb01rst performing a matrix\nmultiplication to create the centroid states and estimate their norms, and then a tomography to get\n\n7\n\n\fa classical description of the centroids. The error in this part is \u0001centroids, as de\ufb01ned in Claim 2.4,\nnamely: (cid:107)cj \u2212 cj(cid:107) \u2264 \u0001centroid =\n\u03b7(\u00013 + \u00014).. Again, for ensuring that the q-means is consistent\n\u221a\nwith the classical \u03b4-k-means algorithm we take \u00013 < \u03b4\n\u03b7 . Note also that we have\n4\nignored the error \u00012 that we can easily deal with since it only appears in a logarithmic factor.\n\n\u221a\n\u03b7 and \u00014 < \u03b4\n4\n\n\u221a\n\n3.2 Runtime analysis\n\nAs the classical algorithm, the runtime of q-means depends linearly on the number of iterations,\nso here we analyze the cost of a single step. The cost of tomography for the k centroid vec-\nj(cid:105). A single\ntors is O( kd log k log d\nj(cid:105) is prepared applying the matrix multiplication by V T procedure on the state |\u03c7t\ncopy of |ct\nj(cid:105)\nj(cid:105) is\nobtained using square distance estimation. The time required for preparing a single copy of |ct\nO(\u03ba(V )(\u00b5(V ) + T\u03c7) log(1/\u00012)) by Theorem A.8 where T\u03c7 is the time for preparing |\u03c7t\nj(cid:105). The time\n\n) times the cost of preparation of a single centroid state |ct\n\n\u00014\n\n2\n\n) by Theorem 2.1.\n\n\u00013\n\nThe cost of norm estimation for k different centroids is independent of the tomography cost and\n). Combining together all these costs and suppressing all the logarithmic factors\n+ k2 \u03b7\nThe analysis in\n\u00013\u00011\n\u221a\n\u03b7 and \u00014 = \u03b4\n\u03b7 . Substituting these values in\n4\n\n\u221a\nsection 3.1 shows that we can take \u00011 = \u03b4/2, \u00013 = \u03b4\n4\nthe above running time, it follows that the running time of the q-means algorithm is\n\n\u00b5(V ) + k \u03b7\n\u00011\n\n\u03ba(V )\u00b5(V )\n\nkd 1\n\u00012\n4\n\n\u03ba(V )\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\n(cid:16) k\u03b7 log(\u2206\u22121) log(N d)\n\nT\u03c7 is (cid:101)O\nis (cid:101)O( kT\u03c7\u03ba(V )\u00b5(V )\nwe have a total running time of (cid:101)O\n\n= (cid:101)O( k\u03b7\n(cid:16)\n\n\u00011\n\n\u00011\n\n(cid:18)\n\n(cid:101)O\n\n(cid:16)\n\n(cid:17)\n\nkd\n\n\u03b7\n\u03b42 \u03ba(V )\n\n\u00b5(V ) + k\n\n\u03b7\n\u03b4\n\n+ k2 \u03b71.5\n\n\u03b42 \u03ba(V )\u00b5(V )\n\n.\n\n(9)\n\n(cid:19)\n\n(cid:16)\n\n(cid:17)\n\nk2d \u03b72.5\n\n\u03b43 + k2.5 \u03b72\n\nThis completes the proof of Theorem 3.1. We next state our main result when applied to a well-\nclusterable dataset, as in Section 1.4.\nTheorem 3.2 (q-means on well-clusterable data). For a well-clusterable dataset V \u2208 RN\u00d7d stored in\nappropriate QRAM, the q-means algorithm returns with high probability the k centroids consistently\nper iteration,\n\nwith the classical \u03b4-k-means algorithm for a constant \u03b4 in time (cid:101)O\n\nfor 1 \u2264 (cid:107)vi(cid:107)2 \u2264 \u03b7.\nThe proof of this Theorem is provided in Supplementary Material, Section A.4.3. At high level,\nwe use claims to bound the parameters \u03ba(V ) and\u00b5(V ) of well-clusterable datasets, as well as error\nparameters, thanks to the rank, singular values and distribution properties of such datasets.\nLet us make some concluding remarks regarding the running time of q-means. For dataset where\nthe number of points is much bigger compared to the other parameters, the running time for the\nq-means algorithm is an improvement compared to the classical k-means algorithm. For instance,\nfor most problems in data analysis, k is eventually small (< 100). The number of features d \u2264 N in\nmost situations, and it can eventually be reduced by applying a quantum dimensionality reduction\nalgorithm [21] \ufb01rst (which have running time poly-logarithmic in d). To sum up, q-means has the\nsame output as the classical \u03b4-k-means algorithm (which is a robust version of k-means with similar\nrunning time and performance), it conserves the same number of iterations, but has a running time\nonly poly-logarithmic in N, giving an exponential speedup with respect to the size of the dataset.\n\n\u03b43\n\n4 Simulations on real data\n\nWe would like to demonstrate that the quantum algorithm provides accurate classi\ufb01cation results.\nHowever, since neither quantum simulators nor quantum computers large enough to test q-means are\navailable currently, we tested the equivalent classical implementation of \u03b4-k-means, knowing that our\nquantum algorithms provides results consistent with the \u03b4-k-means algorithm. For implementing the\n\u03b4-k-means, we changed the assignment step of the k-means algorithm to select a random centroid\namong those that are \u03b4-close to the closest centroid and added \u03b4/2 error to the updated clusters. We\nbenchmarked our q-means algorithm on two datasets: the well known MNIST dataset of handwritten\ndigits and a synthetic dataset of gaussian clusters. To measure and compare the accuracy of our\n\n8\n\n\fclustering algorithm, we ran the k-means and the \u03b4-k-means algorithms for different values of \u03b4 on a\ntraining dataset and then we compared the accuracy of the classi\ufb01cation on a test set, containing data\npoints on which the algorithms have not been trained, using a number of widely-used performance\nmeasures. More experiments are provided in the Supplementary Material.\n\nFigure 1: Accuracy evolution on the MNIST dataset under k-means and q-means (\u03b4-k-means) for 4\ndifferent values of \u03b4. Data has been preprocessed by a PCA to 40 dimensions. All versions converge\nin the same number of steps, with a drop in the accuracy while \u03b4 increases. The apparent increase of\nsteps until convergence is just due to the fact of the different stopping condition for \u03b4-k-means.\n\nThe MNIST dataset is composed of 60.000 handwritten digits as images of 28x28 pixels (784\ndimensions). From this data we \ufb01rst performed some dimensionality reduction processing, then we\nnormalized the data such that the minimum norm is one. Note that a quantum computer could also be\nused for dimensionality reduction algorithms like [28, 10]. As preprocessing, we \ufb01rst performed a\nPrincipal Component Analysis (PCA), retaining data projected in a subspace of dimension 40. After\nnormalization, the value of \u03b7 was 8.25 (maximum norm of 2.87), and the condition number for the\ndata matrix was 4.53. Figure 1 represents the evolution of the accuracy during the k-means and\n\u03b4-k-means for 4 different values of \u03b4. In this numerical experiment, we can see that for values of\nthe parameter \u03b7/\u03b4 of order 20, both k-means and \u03b4-k-means reached a similar accuracy in the same\nnumber of steps. Notice that the MNIST dataset, without other preprocessing than dimensionality\nreduction, is known not to be well-clusterable, hence the low accuracy reached. More experimental\ndetails are provided in Supplementary Material, Section A.7.\n\nConclusions\nIn our experiments, the values of \u03b7/\u03b4 remained between 3 and 20. Moreover, the\nparameter \u03b7 = maxi (cid:107)vi(cid:107)2 provides a worst case guarantee for the algorithm. One can expect that\nthe running time in practice will scale with the average square norm of the points. For the MNIST\ndataset after PCA, this value is 2.65 whereas \u03b7 = 8.3. Our simulations show that the convergence\nrate of \u03b4-k-means is almost the same as the regular k-means algorithms even for large enough \u03b4. This\nprovides evidence that the q-means algorithm will have as good performance as the classical k-means,\nwith a running time that is signi\ufb01cantly lower than that of the classical algorithms for large datasets.\n\n9\n\n\fReferences\n[1] Dimitris Achlioptas and Frank McSherry. Fast computation of low rank matrix approximations.\n\nIn Proceedings of the 33rd Annual Symposium on Theory of Computing, 611-618, 2001.\n\n[2] Esma A\u00efmeur, Gilles Brassard, and S\u00e9bastien Gambs. Quantum speed-up for unsupervised\n\nlearning. Machine Learning, 90(2):261\u2013287, 2013.\n\n[3] Jonathan Allcock, Chang-Yu Hsieh, Iordanis Kerenidis, and Shengyu Zhang. Quantum algo-\n\nrithms for feedforward neural networks. arXiv preprint arXiv:1812.03089, 2018.\n\n[4] Andris Ambainis. Variable time amplitude ampli\ufb01cation and quantum algorithms for linear\nalgebra problems. In STACS\u201912 (29th Symposium on Theoretical Aspects of Computer Science),\nvolume 14, pages 636\u2013647. LIPIcs, 2012.\n\n[5] Juan Miguel Arrazola, Alain Delgado, Bhaskar Roy Bardhan, and Seth Lloyd. Quantum-inspired\n\nalgorithms in practice. arXiv preprint arXiv:1905.10415, 2019.\n\n[6] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In\nProceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages\n1027\u20131035. Society for Industrial and Applied Mathematics, 2007.\n\n[7] Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum amplitude ampli\ufb01cation\n\nand estimation. Contemporary Mathematics, 305:53\u201374, 2002.\n\n[8] Shantanav Chakraborty, Andr\u00e1s Gily\u00e9n, and Stacey Jeffery. The power of block-encoded matrix\npowers: improved regression techniques via faster Hamiltonian simulation. arXiv preprint\narXiv:1804.01973, 2018.\n\n[9] Nai-Hui Chia, Andr\u00e1s Gily\u00e9n, Tongyang Li, Han-Hsuan Lin, Ewin Tang, and Chunhao Wang.\nSampling-based sublinear low-rank matrix arithmetic framework for dequantizing quantum\nmachine learning. arXiv preprint arXiv:1910.06151, 2019.\n\n[10] Iris Cong and Luming Duan. Quantum discriminant analysis for dimensionality reduction and\n\nclassi\ufb01cation. arXiv preprint arXiv:1510.00113, 2015.\n\n[11] Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, and V Vinay. Clustering large\n\ngraphs via the singular value decomposition. Machine learning, 56(1-3):9\u201333, 2004.\n\n[12] Petros Drineas, Iordanis Kerenidis, and Prabhakar Raghavan. Competitive recommendation\nsystems. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing,\npages 82\u201390. ACM, 2002.\n\n[13] Christoph Durr and Peter Hoyer. A quantum algorithm for \ufb01nding the minimum. arXiv preprint\n\nquant-ph/9607014, 1996.\n\n[14] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A quantum approximate optimization\n\nalgorithm. arXiv preprint arXiv:1411.4028, 2014.\n\n[15] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,\n\nvolume 1. Springer series in statistics New York, NY, USA:, 2001.\n\n[16] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for \ufb01nding\n\nlow-rank approximations. Journal of the ACM (JACM), 51(6):1025\u20131041, 2004.\n\n[17] Andr\u00e1s Gily\u00e9n, Seth Lloyd, and Ewin Tang. Quantum-inspired low-rank stochastic regression\n\nwith logarithmic dependence on the dimension. arXiv preprint arXiv:1811.04909, 2018.\n\n[18] Andr\u00e1s Gily\u00e9n, Yuan Su, Guang Hao Low, and Nathan Wiebe. Quantum singular value\ntransformation and beyond: exponential improvements for quantum matrix arithmetics. arXiv\npreprint arXiv:1806.01838, 2018.\n\n[19] Aram W Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of\n\nequations. Physical review letters, 103(15):150502, 2009.\n\n10\n\n\f[20] Dhawal Jethwani, Fran\u00e7ois Le Gall, and Sanjay K Singh. Quantum-inspired classical algorithms\n\nfor singular value transformation. arXiv preprint arXiv:1910.05699, 2019.\n\n[21] Iordanis Kerenidis and Alessandro Luongo. Quantum classi\ufb01cation of the MNIST dataset via\n\nslow feature analysis. arXiv preprint arXiv:1805.08837, 2018.\n\n[22] Iordanis Kerenidis and Anupam Prakash. Quantum gradient descent for linear systems and least\n\nsquares. arXiv:1704.04992, 2017.\n\n[23] Iordanis Kerenidis and Anupam Prakash. Quantum recommendation systems. Proceedings of\n\nthe 8th Innovations in Theoretical Computer Science Conference, 2017.\n\n[24] Iordanis Kerenidis and Anupam Prakash. A quantum interior point method for LPs and SDPs.\n\narXiv:1808.09266, 2018.\n\n[25] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm.\nIn 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 299\u2013308.\nIEEE, 2010.\n\n[26] Robert Layton. A demo of k-means clustering on the handwritten digits data, 1999. https:\n//scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html.\n\n[27] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and\n\nunsupervised machine learning. arXiv, 1307.0411:1\u201311, 7 2013.\n\n[28] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis.\n\nNature Physics, 10(9):631, 2014.\n\n[29] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory,\n\n28(2):129\u2013137, 1982.\n\n[30] Michael A Nielsen and Isaac Chuang. Quantum computation and quantum information, 2002.\n\n[31] JS Otterbach, R Manenti, N Alidoust, A Bestwick, M Block, B Bloom, S Caldwell, N Didier,\nE Schuyler Fried, S Hong, et al. Unsupervised machine learning on a hybrid quantum computer.\narXiv preprint arXiv:1712.05771, 2017.\n\n[32] JS Otterbach, R Manenti, N Alidoust, A Bestwick, M Block, B Bloom, S Caldwell, N Didier,\nE Schuyler Fried, S Hong, et al. Unsupervised machine learning on a hybrid quantum computer.\narXiv preprint arXiv:1712.05771, 2017.\n\n[33] Amnon Ta-Shma. Inverting well conditioned matrices in quantum logspace. In Proceedings of\nthe forty-\ufb01fth annual ACM symposium on Theory of computing, pages 881\u2013890. ACM, 2013.\n\n[34] Ewin Tang. A quantum-inspired classical algorithm for recommendation systems. arXiv\n\npreprint arXiv:1807.04271, 2018.\n\n[35] Ewin Tang. Quantum-inspired classical algorithms for principal component analysis and\n\nsupervised clustering. arXiv preprint arXiv:1811.00414, 2018.\n\n[36] Nathan Wiebe, Ashish Kapoor, and Krysta M Svore. Quantum algorithms for nearest-neighbor\nmethods for supervised and unsupervised learning. Quantum Information & Computation,\n15(3-4):316\u2013356, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2295, "authors": [{"given_name": "Iordanis", "family_name": "Kerenidis", "institution": "Universit\u00e9 Paris Diderot"}, {"given_name": "Jonas", "family_name": "Landman", "institution": "Universit\u00e9 Paris Diderot"}, {"given_name": "Alessandro", "family_name": "Luongo", "institution": "Institut de Recherche en Informatique Fondamentale"}, {"given_name": "Anupam", "family_name": "Prakash", "institution": "Universit\u00e9 Paris Diderot"}]}