{"title": "Making AI Forget You: Data Deletion in Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3518, "page_last": 3531, "abstract": "Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used --- the EU\u2019s Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of efficiently deleting individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual's data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of $k$-means clustering, we propose two provably deletion efficient algorithms which achieve an average of over $100\\times$ improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical $k$-means++ baseline.", "full_text": "Making AI Forget You:\n\nData Deletion in Machine Learning\n\nAntonio A. Ginart1, Melody Y. Guan2, Gregory Valiant2, and James Zou3\n\n1Dept. of Electrical Engineering\n\n2Dept. of Computer Science\n\n3Dept. of Biomedial Data Science\n\nStanford University, Palo Alto, CA 94305\n\n{tginart, mguan, valiant, jamesz}@stanford.edu\n\nAbstract\n\nIntense recent discussions have focused on how to provide individuals with control\nover when their data can and cannot be used \u2014 the EU\u2019s Right To Be Forgotten\nregulation is an example of this effort. In this paper we initiate a framework studying\nwhat to do when it is no longer permissible to deploy models derivative from speci\ufb01c\nuser data. In particular, we formulate the problem of ef\ufb01ciently deleting individual\ndata points from trained machine learning models. For many standard ML models,\nthe only way to completely remove an individual\u2019s data is to retrain the whole model\nfrom scratch on the remaining data, which is often not computationally practical.\nWe investigate algorithmic principles that enable ef\ufb01cient data deletion in ML.\nFor the speci\ufb01c setting of k-means clustering, we propose two provably ef\ufb01cient\ndeletion algorithms which achieve an average of over 100\u00d7 improvement in deletion\nef\ufb01ciency across 6 datasets, while producing clusters of comparable statistical\nquality to a canonical k-means++ baseline.\n\n1\n\nIntroduction\n\nRecently, one of the authors received the redacted email below, informing us that an individual\u2019s data\ncannot be used any longer. The UK Biobank [79] is one of the most valuable collections of genetic\nand medical records with half a million participants. Thousands of machine learning classi\ufb01ers are\ntrained on this data, and thousands of papers have been published using this data.\n\nEMAIL \u2013\u2013 UK BIOBANK \u2013\u2013\nSubject: UK Biobank Application [REDACTED], Participant Withdrawal Notification [REDACTED]\n\nDear Researcher,\n\nAs you are aware, participants are free to withdraw form the UK Biobank at any time and request that their\ndata no longer be used. Since our last review, some participants involved with Application [REDACTED]\nhave requested that their data should longer be used.\n\nThe email request from the UK Biobank illustrates a fundamental challenge the broad data science\nand policy community is grappling with: how should we provide individuals with \ufb02exible control\nover how corporations, governments, and researchers use their data? Individuals could decide at\nany time that they do not wish for their personal data to be used for a particular purpose by a particular\nentity. This ability is sometimes legally enforced. For example, the European Union\u2019s General\nData Protection Regulation (GDPR) and former Right to Be Forgotten [24, 23] both require that\ncompanies and organizations enable users to withdraw consent to their data at any time under certain\ncircumstances. These regulations broadly affect international companies and technology platforms\nwith EU customers and users. Legal scholars have pointed out that the continued use of AI systems\ndirectly trained on deleted data could be considered illegal under certain interpretations and ultimately\nconcluded that: it may be impossible to ful\ufb01ll the legal aims of the Right to be Forgotten in arti\ufb01cial\nintelligence environments [86]. Furthermore, so-called model-inversion attacks have demonstrated\nthe capability of adversaries to extract user information from trained ML models [85].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fConcretely, we frame the problem of data deletion in machine learning as follows. Suppose a\nstatistical model is trained on n datapoints. For example, the model could be trained to perform disease\ndiagnosis from data collected from n patients. To delete the data sampled from the i-th patient from\nour trained model, we would like to update it such that it becomes independent of sample i, and looks\nas if it had been trained on the remaining n\u2212 1 patients. A naive approach to satisfy the requested\ndeletion would be to retrain the model from scratch on the data from the remaining n\u22121 patients. For\nmany applications, this is not a tractable solution \u2013 the costs (in time, computation, and energy) for\ntraining many machine learning models can be quite high. Large scale algorithms can take weeks to\ntrain and consume large amounts of electricity and other resources. Hence, we posit that ef\ufb01cient data\ndeletion is a fundamental data management operation for machine learning models and AI systems,\njust like in relational databases or other classical data structures.\n\nBeyond supporting individual data rights, there are various other possible use cases in which\nef\ufb01cient data deletion is desirable. To name a few examples, it could be used to speed-up leave-one-\nout-cross-validation [2], support a user data marketplace [75, 80], or identify important or valuable\ndatapoints within a model [37].\n\nDeletion ef\ufb01ciency for general learning algorithms has not been previously studied. While the\ndesired output of a deletion operation on a deterministic model is fairly obvious, we have yet to even\nde\ufb01ne data deletion for stochastic learning algorithms. At present, there is only a handful of learning\nalgorithms known to support fast data deletion operations, all of which are deterministic. Even so,\nthere is no pre-existing notion of how engineers should think about the asymptotic deletion ef\ufb01ciency\nof learning systems, nor understanding of the kinds of trade-offs such systems face.\n\nThe key components of this paper include introducing deletion ef\ufb01cient learning, based on an\nintuitive and operational notion of what it means to (ef\ufb01ciently) delete data from a (possibly stochastic)\nstatistical model. We pose data deletion as an online problem, from which a notion of optimal deletion\nef\ufb01ciency emerges from a natural lower bound on amortized computation time. We do a case-study\non deletion ef\ufb01cient learning using the simple, yet perennial, k-means clustering problem. We\npropose two deletion ef\ufb01cient algorithms that (in certain regimes) achieve optimal deletion ef\ufb01ciency.\nEmpirically, on six datasets, our methods achieve an average of over 100\u00d7 speedup in amortized\nruntime with respect to the canonical Lloyd\u2019s algorithm seeded by k-means++ [53, 5]. Simultaneously,\nour proposed deletion ef\ufb01cient algorithms perform comparably to the canonical algorithm on three\ndifferent statistical metrics of clustering quality. Finally, we synthesize an algorithmic toolbox for\ndesigning deletion ef\ufb01cient learning systems.\n\nWe summarize our work into three contributions:\n\n(1) We formalize the problem and notion of ef\ufb01cient data deletion in the context of machine learning.\n\n(2) We propose two different deletion ef\ufb01cient solutions for k-means clustering that have theoretical\n\nguarantees and strong empirical results.\n\n(3) From our theory and experiments, we synthesize four general engineering principles for\n\ndesigning deletion ef\ufb01cient learning systems.\n\n2 Related Works\n\nDeterministic Deletion Updates As mentioned in the introduction, ef\ufb01cient deletion operations\nare known for some canonical learning algorithms. They include linear models [55, 27, 83, 81, 18, 74],\ncertain types of lazy learning [88, 6, 11] techniques such as non-parametric Nadaraya-Watson kernel\nregressions [61] or nearest-neighbors methods [22, 74], recursive support vector machines [19, 81],\nand co-occurrence based collaborative \ufb01ltering [74].\n\nData Deletion and Data Privacy Related ideas for protecting data in machine learning \u2014 e.g.\ncryptography [63, 16, 14, 13, 62, 31], and differential privacy [30, 21, 20, 64, 1] \u2014 do not lead to\nef\ufb01cient data deletion, but rather attempt to make data private or non-identi\ufb01able. Algorithms that\nsupport ef\ufb01cient deletion do not have to be private, and algorithms that are private do not have to\nsupport ef\ufb01cient deletion. To see the difference between privacy and data deletion, note that every\nlearning algorithm supports the naive data deletion operation of retraining from scratch. The algorithm\nis not required to satisfy any privacy guarantees. Even an operation that outputs the entire dataset in\nthe clear could support data deletion, whereas such an operation is certainly not private. In this sense,\nthe challenge of data deletion only arises in the presence of computational limitations. Privacy, on the\nother hand, presents statistical challenges, even in the absence of any computational limitations. With\nthat being said, data deletion has direct connections and consequences in data privacy and security,\nwhich we explore in more detail in Appendix A.\n\n2\n\n\f3 Problem Formulation\n\nWe proceed by describing our setting and de\ufb01ning the notion of data deletion in the context of a\nmachine learning algorithm and model. Our de\ufb01nition formalizes the intuitive goal that after a speci\ufb01ed\ndatapoint, x, is deleted, the resulting model is updated to be indistinguishable from a model that was\ntrained from scratch on the dataset sans x. Once we have de\ufb01ned data deletion, we de\ufb01ne a notion of\ndeletion ef\ufb01ciency in the context of an online setting. Finally, we conclude by synthesizing high-level\nprinciples for designing deletion ef\ufb01cient learning algorithms.\n\nThroughout we denote dataset D ={x1,...,xn} as a set consisting of n datapoints, with each datapoint\nxi \u2208 Rd; for simplicity, we often represent D as a n\u00d7 d real-valued matrix as well. Let A denote\na (possibly randomized) algorithm that maps a dataset to a model in hypothesis space H. We allow\nmodels to also include arbitrary metadata that is not necessarily used at inference time. Such metadata\ncould include data structures or partial computations that can be leveraged to help with subsequent\ndeletions. We also emphasize that algorithm A operates on datasets of any size. Since A is often\nstochastic, we can also treat A as implicitly de\ufb01ning a conditional distribution over H given dataset D.\nDe\ufb01nition 3.1. Data Deletion Operation: We de\ufb01ne a data deletion operation for learning algorithm\nA, RA(D,A(D),i), which maps the dataset D, model A(D), and index i\u2208{1,...,n} to some model\nin H. Such an operation is a data deletion operation if, for all D and i, random variables A(D\u2212i) and\nRA(D,A(D),i) are equal in distribution, A(D\u2212i) =d RA(D,A(D),i).\n\nHere we focus on exact data deletion: after deleting a training point from the model, the model should\nbe as if this training point had never been seen in the \ufb01rst place. The above de\ufb01nition can naturally be\nrelaxed to approximate data deletion by requiring a bound on the distance (or divergence) between dis-\ntributions of A(D\u2212i) and RA(D,A(D),i). Refer to Appendix A for more details on approximate data\ndeletion, especially in connection to differential privacy. We defer a full discussion of this to future work.\n\nA Computational Challenge Every learning algorithm, A, supports a trivial data deletion operation\ncorresponding to simply retraining on the new dataset after the speci\ufb01ed datapoint has been removed\n\u2014 namely running algorithm A on the dataset D\u2212i. Because of this, the challenge of data deletion\nis computational: 1) Can we design a learning algorithm A, and supporting data structures, so as to\nallow for a computationally ef\ufb01cient data deletion operation? 2) For what algorithms A is there a data\ndeletion operation that runs in time sublinear in the size of the dataset, or at least sublinear in the time\nit takes to compute the original model, A(D)? 3) How do restrictions on the memory-footprint of\nthe metadata contained in A(D) impact the ef\ufb01ciency of data deletion algorithms?\n\nData Deletion as an Online Problem One convenient way of concretely formulating the computa-\ntional challenge of data deletion is via the lens of online algorithms [17]. Given a dataset of n datapoints,\na speci\ufb01c training algorithm A, and its corresponding deletion operation RA, one can consider a stream\nof m\u2264 n distinct indices, i1,i2,...,im \u2208{1,...,n}, corresponding to the sequence of datapoints to be\ndeleted. The online task then is to design a data deletion operation that is given the indices {ij} one\nat a time, and must output A(D\u2212{i1,...,ij }) upon being given index ij . As in the extensive body of\nwork on online algorithms, the goal is to minimize the amortized computation time. The amortized\nruntime in the proposed online deletion setting is a natural and meaningful way to measure deletion\nef\ufb01ciency. A formal de\ufb01nition of our proposed online problem setting can be found in Appendix A.\n\nIn online data deletion, a simple lower bound on amortized runtime emerges. All (sequential)\nlearning algorithms A run in time \u2126(n) under the natural assumption that A must process each\ndatapoint at least once. Furthermore, in the best case, A comes with a constant time deletion operation\n(or a deletion oracle).\nRemark 3.1. In the online setting, for n datapoints and m deletion requests we establish an asymptotic\nlower bound of \u2126( n\n\nm ) for the amortized computation time of any (sequential) learning algorithm.\n\nWe refer to an algorithm achieving this lower bound as deletion ef\ufb01cient. Obtaining tight upper\nand lower bounds is an open question for many basic learning paradigms including ridge regression,\ndecision tree models, and settings where A corresponds to the solution to a stochastic optimization\nproblem. In this paper, we do a case study on k-means clustering, showing that we can achieve deletion\nef\ufb01ciency without sacri\ufb01cing statistical performance.\n\n3.1 General Principles for Deletion Ef\ufb01cient Machine Learning Systems\n\nWe identify four design principles which we envision as the pillars of deletion ef\ufb01cient learning\nalgorithms.\n\n3\n\n\fLinearity Use of linear computation allows for simple post-processing to undo the in\ufb02uence of\na single datapoint on a set of parameters. Generally speaking, the Sherman-Morrison-Woodbury\nmatrix identity and matrix factorization techniques can be used to derive fast and explicit formulas for\nupdating linear models [55, 27, 83, 43]. For example, in the case of linear least squares regressions, QR\nfactorization can be used to delete datapoints from learned weights in time O(d2) [41, 90]. Linearity\nshould be most effective in domains in which randomized [70], reservoir [89, 76], domain-speci\ufb01c\n[54], or pre-trained feature spaces elucidate linear relationships in the data.\n\nLaziness Lazy learning methods delay computation until inference time [88, 11, 6], resulting in\ntrivial deletions. One of the simplest examples of lazy learning is k-nearest neighbors [32, 4, 74],\nwhere deleting a point from the dataset at deletion time directly translates to an updated model at\ninference time. There is a natural af\ufb01nity between lazy learning and non-parametric techniques\n[61, 15]. Although we did not make use of laziness for unsupervised learning in this work, pre-existing\nliterature on kernel density estimation for clustering would be a natural starting place [44]. Laziness\nshould be most effective in regimes when there are fewer constraints on inference time and model\nmemory than training time or deletion time. In some sense, laziness can be interpreted as shifting\ncomputation from training to inference. As a side effect, deletion can be immensely simpli\ufb01ed.\n\nModularity In the context of deletion ef\ufb01cient learning, modularity is the restriction of dependence\nof computation state or model parameters to speci\ufb01c partitions of the dataset. Under such a\nmodularization, we can isolate speci\ufb01c modules of data processing that need to be recomputed in\norder to account for deletions to the dataset. Our notion of modularity is conceptually similar to its use\nin software design [10] and distributed computing [67]. In DC-k-means, we leverage modularity by\nmanaging the dependence between computation and data via the divide-and-conquer tree. Modularity\nshould be most effective in regimes for which the dimension of the data is small compared to the\ndataset size, allowing for partitions of the dataset to capture the important structure and features.\n\nQuantization Many models come with a sense of continuity from dataset space to model space \u2014\nsmall changes to the dataset should result in small changes to the (distribution over the) model. In\nstatistical and computational learning theory, this idea is known to as stability [60, 47, 50, 29, 77, 68]. We\ncan leverage stability by quantizing the mapping from datasets to models (either explicitly or implicitly).\nThen, for a small number of deletions, such a quantized model is unlikely to change. If this can be\nef\ufb01ciently veri\ufb01ed at deletion time, then it can be used for fast average-case deletions. Quantization\nis most effective in regimes for which the number of parameters is small compared to the dataset size.\n\n4 Deletion Ef\ufb01cient Clustering\n\nData deletion is a general challenge for machine learning. Due to its simplicity we focus on k-means\nclustering as a case study. Clustering is a widely used ML application, including on the UK Biobank\n(for example as in [33]). We propose two algorithms for deletion ef\ufb01cient k-means clustering. In the\ncontext of k-means, we treat the output centroids as the model from which we are interested in deleting\ndatapoints. We summarize our proposed algorithms and state theoretical runtime complexity and\nstatistical performance guarantees. Please refer to [32] for background concerning k-means clustering.\n\n4.1 Quantized k-Means\n\nWe propose a quantized variant of Lloyd\u2019s algorithm as a deletion ef\ufb01cient solution to k-means\nclustering, called Q-k-means. By quantizing the centroids at each iteration, we show that the\nalgorithm\u2019s centroids are constant with respect to deletions with high probability. Under this notion\nof quantized stability, we can support ef\ufb01cient deletion, since most deletions can be resolved without\nre-computing the centroids from scratch. Our proposed algorithm is distinct from other quantized\nversions of k-means [73], which quantize the data to minimize memory or communication costs. We\npresent an abridged version of the algorithm here (Algorithm 1). Detailed pseudo-code for Q-k-means\nand its deletion operation may be found in Appendix B.\n\nQ-k-means follows the iterative protocol as does the canonical Lloyd\u2019s algorithm (and makes use\nof the k-means++ initialization). There are four key differences from Lloyd\u2019s algorithm. First and\nforemost, the centroids are quantized in each iteration before updating the partition. The quantization\nmaps each point to the nearest vertex of a uniform \u01eb-lattice [38]. To de-bias the quantization, we apply a\nrandom phase shift to the lattice. The particulars of the quantization scheme are discussed in Appendix\nB. Second, at various steps throughout the computation, we memoize the optimization state into the\nmodel\u2019s metadata for use at deletion time (incurring an additional O(ktd) memory cost). Third, we\n\n4\n\n\fintroduce a balance correction step, which compensates for \u03b3-imbalanced clusters by averaging current\ncentroids with a momentum term based on the previous centroids. Explicitly, for some \u03b3\u2208 (0,1), we\nconsider any partition \u03c0\u03ba to be \u03b3-imbalanced if |\u03c0\u03ba| \u2264 \u03b3n\nk . We may think of \u03b3 as being the ratio of\nthe smallest cluster size to the average cluster size. Fourth, because of the quantization, the iterations\nare no longer guaranteed to decrease the loss, so we have an early termination if the loss increases\nat any iteration. Note that the algorithm terminates almost surely.\n\nDeletion in Q-k-means is straightforward. Us-\ning the metadata saved from training time, we\ncan verify if deleting a speci\ufb01c datapoint would\nhave resulted in a different quantized centroid\nthan was actually computed during training. If\nthis is the case (or if the point to be deleted is\none of randomly chosen initial centroids accord-\ning to k-means++) we must retrain from scratch\nto satisfy the deletion request. Otherwise, we\nmay satisfy deletion by updating our metadata\nto re\ufb02ect the deletion of the speci\ufb01ed datapoint,\nbut we do not have to recompute the centroids.\nQ-k-means directly relies the principle of quanti-\nzation to enable fast deletion in expectation. It is\nalso worth noting that Q-k-means also leverages\non the principle of linearity to recycle computa-\ntion. Since centroid computation is linear in the\ndatapoints, it is easy to determine the centroid\nupdate due to a removal at deletion time.\n\nAlgorithm 1 Quantized k-means (abridged)\n\nInput: data matrix D \u2208 Rn\u00d7d\nParameters: k \u2208 N, T \u2208 N, \u03b3 \u2208 (0,1), \u01eb > 0\nc \u2190 k++(D) // initialize centroids with k-means++\nSave initial centroids: save(c).\nL \u2190 k-means loss of initial partition \u03c0(c)\nfor \u03c4 = 1 to T do\n\nStore current centroids: c\u2032 \u2190 c\nCompute centroids: c \u2190 c(\u03c0)\nApply correction to \u03b3-imbalanced partitions\nQuantize to random \u01eb-lattice: \u02c6c \u2190 Q(c;\u03b8)\nUpdate partition: \u03c0\u2032 \u2190 \u03c0(\u02c6c)\nSave state to metadata: save(c,\u03b8,\u02c6c,|\u03c0\u2032|)\nCompute loss L\u2032\nif L\u2032 < L then (c,\u03c0,L) \u2190 (\u02c6c,\u03c0\u2032,L\u2032) else break\n\nend for\nreturn c //output \ufb01nal centroids as model\n\nDeletion Time Complexity We turn our attention to an asymptotic time complexity analysis of\nQ-k-means deletion operation. Q-k-means supports deletion by quantizing the centroids, so they are\nstable to against small perturbations (caused by deletion of a point).\nTheorem 4.1. Let D be a dataset on [0,1]d of size n. Fix parameters T , k, \u01eb, and \u03b3 for Q-k-means.\nThen, Q-k-means supports m deletions in time O(m2d5/2/\u01eb) in expectation, with probability over the\nrandomness in the quantization phase and k-means++ initialization.\n\nThe proof for the theorem is given in Appendix C. The intuition is as follows. Centroids are computed\nby taking an average. With enough terms in an average, the effect of a small number of those terms is\nnegligible. The removal of those terms from the average can be interpreted as a small perturbation to\nthe centroid. If that small perturbation is on a scale far below the granularity of the quantizing \u01eb-lattice,\nthen it is unlikely to change the quantized value of the centroid. Thus, beyond stability veri\ufb01cation,\nno additional computation is required for a majority of deletion requests. This result is in expectation\nwith respect to the randomized initializations and randomized quantization phase, but is actually\nworst-case over all possible (normalized) dataset instances. The number of clusters k, iterations T ,\nand cluster imbalance ratio \u03b3 are usually small constants in many applications, and are treated as such\nhere. Interestingly, for constant m and \u01eb, the expected deletion time is independent of n due to the\nstability probability increasing at the same rate as the problem size (see Appendix C). Deletion time\nfor this method may not scale well in the high-dimensional setting. In the low-dimensional case, the\nmost interesting interplay is between \u01eb, n, and m. To obtain as high-quality statistical performance as\npossible, it would be ideal if \u01eb\u2192 0 as n\u2192\u221e. In this spirit, we can parameterize \u01eb = n\u2212\u03b2 for \u03b2\u2208 (0,1).\n\nWe will use this parameterization for theoretical analysis of the online setting in Section 4.3.\n\ninstance. Achieving the optimal solution is, in general, NP-Hard [3]. Instead, we can approximate it\n\nTheoretical Statistical Performance We proceed to state a theoretical guarantee on statistical\nperformance of Q-k-means, which complements the asymptotic time complexity bound of the deletion\noperation. Recall that the loss for a k-means problem instance is given by the sum of squared Euclidean\ndistance from each datapoint to its nearest centroid. Let L\u2217 be the optimal loss for a particular problem\nwith k-means++, which achieves EL++\u2264 (8logk+16)L\u2217 [5].\nCorollary 4.1.1. Let L be a random variable denoting the loss of Q-k-means on a particular problem\ninstance of size n. Then EL\u2264 (8logk+16)L\u2217 +\u01ebpnd(8logk+16)L\u2217 + 1\n\n4 nd\u01eb2.\n\nThis corollary follows from the theoretical guarantees already known to apply to Lloyd\u2019s algorithm\nwhen initialized with k-means++, given by [5]. The proof can be found in Appendix C. We can\n\n5\n\n\finterpret the bound by looking at the ratio of expected loss upper bounds for k-means++ and Q-k-\nmeans. If we assume our problem instance is generated by iid samples from some arbitrary non-\n\natomic distribution, then it follows that L\u2217 = O(n). Taking the loss ratio of upper bounds yields\nEL/EL++\u2264 1+O(d\u01eb2 +\u221ad\u01eb). Ensuring that \u01eb << 1/\u221ad implies the upper bound is as good as that\nof k-means++.\n\n4.2 Divide-and-Conquer k-Means\n\nWe turn our attention to another variant of\nLloyd\u2019s algorithm that also supports ef\ufb01cient\ndeletion, albeit through quite different means.\nWe refer to this algorithm as Divide-and-Conquer\nk-means (DC-k-means). At a high-level, DC-\nk-means works by partitioning the dataset into\nsmall sub-problems, solving each sub-problem\nas an independent k-means instance, and recur-\nsively merging the results. We present pseudo-\ncode for DC-k-means here, and we refer the\nreader to Appendix B for pseudo-code of the\ndeletion operation.\n\nAlgorithm 2 DC-k-means\n\nInput: data matrix D \u2208 Rn\u00d7d\nParameters: k \u2208 N, T \u2208 N, tree width w \u2208 N, tree\nheight h \u2208 N\nInitialize a w-ary tree of height h such that each node\nhas a pointer to a dataset and centroids\nfor i = 1 to n do\n\nSelect a leaf node uniformly at random\nnode.dataset.add(Di)\n\nend for\nfor l = h down to 0 do\n\nfor each node in level l do\n\nelse\n\nend if\n\nend for\n\nend for\n\nc \u2190 k-means++(node.dataset,k,T )\nnode.centroids \u2190 c\nif l > 0 then\n\nnode.parent.dataset.add(c)\n\nDC-k-means operates on a perfect w-ary tree\nof height h (this could be relaxed to any rooted\ntree). The original dataset is partitioned into each\nleaf in the tree as a uniform multinomial random\nvariable with datapoints as trials and leaves as\noutcomes. At each of these leaves, we solve\nfor some number of centroids via k-means++.\nWhen we merge leaves into their parent node,\nwe construct a new dataset consisting of all the\ncentroids from each leaf. Then, we compute new centroids at the parent via another instance of\nk-means++. For simplicity, we keep k \ufb01xed throughout all of the sub-problems in the tree, but this\ncould be relaxed. We make use of the tree hierarchy to modularize the computation\u2019s dependence on\nthe data. At deletion time, we need only to recompute the sub-problems from one leaf up to the root.\nThis observation allows us to support fast deletion operations.\n\nsave all nodes as metadata\nreturn c //model output\n\nOur method has close similarities to pre-existing distributed k-means algorithms [69, 67, 9, 7, 39, 8,\n92], but is in fact distinct (not only in that it is modi\ufb01ed for deletion, but also in that it operates over\ngeneral rooted trees). For simplicity, we restrict our discussion to only the simplest of divide-and-\nconquer trees. We focus on depth-1 trees with w leaves where each leaf solves for k centroids. This\nrequires only one merge step with a root problem size of kn/w.\n\nAnalogous to how \u01eb serves as a knob to trade-off between deletion ef\ufb01ciency and statistical perfor-\nmance in Q-k-means, for DC-k-means, we imagine that w might also serve as a similar knob. For\nexample, if w = 1, DC-k-means degenerates into canonical Lloyd\u2019s (as does Q-k-means as \u01eb \u2192 0).\nThe dependence of statistical performance on tree width w is less theoretically tractable than that of\nQ-k-means on \u01eb, but in Appendix D, we empirically show that statistical performance tends to decrease\nas w increases, which is perhaps somewhat expected.\n\nAs we show in our experiments, depth-1 DC-k-means demonstrates an empirically compelling trade-\noff between deletion time and statistical performance. There are various other potential extensions\nof this algorithm, such as weighting centroids based on cluster mass as they propagate up the tree or\nexploring the statistical performance of deeper trees.\n\nDeletion Time Complexity For ensuing asymptotic analysis, we may consider parameterizing tree\n\nwidth w as w = \u0398(n\u03c1) for \u03c1\u2208 (0,1). As before, we treat k and T as small constants. Although intuitive,\nthere are some technical minutia to account for to prove correctness and runtime for the DC-k-means\ndeletion operation. The proof of Proposition 3.2 may be found in Appendix C.\n\nProposition 4.2. Let D be a dataset on Rd of size n. Fix parameters T and k for DC-k-means. Let\nw = \u0398(n\u03c1) and \u03c1\u2208 (0,1) Then, with a depth-1, w-ary divide-and-conquer tree, DC-k-means supports\nm deletions in time O(mmax{n\u03c1,n1\u2212\u03c1}d) in expectation with probability over the randomness in\n\ndataset partitioning.\n\n6\n\n\f4.3 Amortized Runtime Complexity in Online Deletion Setting\n\nWe state the amortized computation time for both of our algorithms in the online deletion setting\nde\ufb01ned in Section 3. We are in an asymptotic regime where the number of deletions m = \u0398(n\u03b1) for\n0 < \u03b1 < 1 (see Appendix C for more details). Recall the \u2126( n\nm ) lower bound from Section 3. For a\nparticular fractional power \u03b1, an algorithm achieving the optimal asymptotic lower bound on amortized\ncomputation is said to be \u03b1-deletion ef\ufb01cient. This corresponds to achieving an amortized runtime of\nO(n1\u2212\u03b1). The following corollaries result from direct calculations which may be found in Appendix\nC. Note that Corollary 4.2.2 assumes DC-k-means is training sequentially.\nCorollary 4.2.1. With \u01eb = \u0398(n\u2212\u03b2), for 0 < \u03b2 < 1, the Q-k-means algorithm is \u03b1-deletion ef\ufb01cient in\nexpectation if \u03b1\u2264 1\u2212\u03b2\n2 .\nCorollary 4.2.2. With w = \u0398(n\u03c1), for 0 < \u03c1 < 1, and a depth-1 w-ary divide-and-conquer tree,\nDC-k-means is \u03b1-deletion ef\ufb01cient in expectation if \u03b1 < 1\u2212max{1\u2212\u03c1,\u03c1}.\n\n5 Experiments\n\nWith a theoretical understanding in hand, we seek to empirically characterize the trade-off between\nruntime and performance for the proposed algorithms. In this section, we provide proof-of-concept\nfor our algorithms by benchmarking their amortized runtimes and clustering quality on a simulated\nstream of online deletion requests. As a baseline, we use the canonical Lloyd\u2019s algorithm initialized\nby k-means++ seeding [53, 5]. Following the broader literature, we refer to this baseline simply as\nk-means, and refer to our two proposed methods as Q-k-means and DC-k-means.\n\nDatasets We run our experiments on \ufb01ve real, publicly available datasets: Celltype (N = 12,009,\nD = 10, K = 4) [42], Covtype (N = 15,120, D = 52, K = 7) [12], MNIST (N = 60,000, D = 784,\nK = 10) [51], Postures (N = 74,975, D = 15, K = 5) [35, 34] , Botnet (N = 1,018,298, D = 115,\nK = 11)[56], and a synthetic dataset made from a Gaussian mixture model which we call Gaussian\n(N = 100,000, D = 25, K = 5). We refer the reader to Appendix D for more details on the datasets. All\ndatasets come with ground-truth labels as well. Although we do not make use of the labels at learning\ntime, we can use them to evaluate the statistical quality of the clustering methods.\n\nOnline Deletion Benchmark We simulate a stream of 1,000 deletion requests, selected uniformly\nat random and without replacement. An algorithm trains once, on the full dataset, and then runs its\ndeletion operation to satisfy each request in the stream, producing an intermediate model at each\nrequest. For the canonical k-means baseline, deletions are satis\ufb01ed by re-training from scratch.\n\nProtocol To measure statistical performance, we evaluate with three metrics (see Section 5.1) that\nmeasure cluster quality. To measure deletion ef\ufb01ciency, we measure the wall-clock time to complete\nour online deletion benchmark. For both of our proposed algorithms, we always \ufb01x 10 iterations\nof Lloyd\u2019s, and all other parameters are selected with simple but effective heuristics (see Appendix\nD). This alleviates the need to tune them. To set a fair k-means baseline, when reporting runtime\non the online deletion benchmark, we also \ufb01x 10 iterations of Lloyd\u2019s, but when reporting statistical\nperformance metrics, we run until convergence. We run \ufb01ve replicates for each method on each dataset\nand include standard deviations with all our results. We refer the reader to Appendix D for more\nexperimental details.\n\n5.1 Statistical Performance Metrics\n\nTo evaluate clustering performance of our algorithms, the most obvious metric is the optimization loss\nof the k-means objective. Recall that this is the sum of square Euclidean distances from each datapoint\nto its nearest centroid. To thoroughly validate the statistical performance of our proposed algorithms,\nwe additionally include two canonical clustering performance metrics.\n\nSilhouette Coef\ufb01cient [72]: This coef\ufb01cient measures a type of correlation (between -1 and +1)\nthat captures how dense each cluster is and how well-separated different clusters are. The silhouette\ncoef\ufb01cient is computed without ground-truth labels, and uses only spatial information. Higher scores\nindicate denser, more well-separated clusters.\n\nNormalized Mutual Information (NMI) [87, 49]: This quantity measures the agreement of the\nassigned clusters to the ground-truth labels, up to permutation. NMI is upper bounded by 1, achieved by\nperfect assignments. Higher scores indicate better agreement between clusters and ground-truth labels.\n\n7\n\n\f5.2 Summary of Results\n\nWe summarize our key \ufb01nd-\nings in four tables. In Tables\n1-3, we report the statistical\nclustering performance of the\n3 algorithms on each of the 6\ndatasets. In Table 1, we report\nthe optimization loss ratios of\nour proposed methods over the\nk-means++ baseline.\n\nIn Table 2, we report the sil-\nhouette coef\ufb01cient for the clus-\nters.\nIn Table 3, we report\nthe NMI. In Table 4, we report\nthe amortized total runtime of\ntraining and deletion for each\nmethod. Overall, we see that\nthe statistical clustering per-\nformance of the three meth-\nods are competitive.\n\nFurthermore, we \ufb01nd that\nboth proposed algorithms\nyield orders of magnitude of\nspeedup. As expected from\nthe theoretical analysis, Q-k-\nmeans offers greater speed-ups\nwhen the dimension is lower\nrelative to the sample size,\nwhereas DC-k-means is more\nconsistent across dimensional-\nities.\n\nTable 1: Loss Ratio\n\nDataset\n\nk-means\n\nQ-k-means\n\nDC-k-means\n\nCelltype\nCovtype\nMNIST\nPostures\nGaussian\nBotnet\n\n1.0\u00b10.0\n\n1.0\u00b10.029\n1.0\u00b10.002\n1.0\u00b10.004\n1.0\u00b10.014\n1.0\u00b10.126\n\n1.158\u00b10.099\n1.033\u00b10.017\n1.11\u00b10.004\n1.014\u00b10.015\n1.019\u00b10.019\n1.018\u00b10.014\n\n1.439\u00b10.157\n1.017\u00b10.031\n1.014\u00b10.003\n1.034\u00b10.017\n1.003\u00b10.014\n1.118\u00b10.102\n\nTable 2: Silhouette Coef\ufb01cients (higher is better)\n\nDataset\n\nk-means\n\nQ-k-means\n\nDC-k-means\n\nCelltype\nCovtype\nGaussian\nPostures\nGaussian\nBotnet\n\n0.384\u00b10.001\n0.238\u00b10.027\n0.036\u00b10.002\n0.107\u00b10.003\n0.066\u00b10.007\n0.583\u00b10.042\n\n0.367\u00b10.048\n0.203\u00b10.026\n0.031\u00b10.002\n0.107\u00b10.004\n0.053\u00b10.003\n0.639\u00b10.028\n\n0.422\u00b10.057\n0.222\u00b10.017\n0.035\u00b10.001\n0.109\u00b10.005\n0.071\u00b10.004\n0.627\u00b10.046\n\nTable 3: Normalized Mutual Information (higher is better)\n\nDataset\n\nk-means\n\nQ-k-means\n\nDC-k-means\n\nCelltype\nCovtype\nMNIST\nGaussian\nPostures\nBotnet\n\n0.36\u00b10.0\n\n0.311\u00b10.009\n0.494\u00b10.006\n0.319\u00b10.024\n0.163\u00b10.018\n0.708\u00b10.048\n\n0.336\u00b10.032\n0.332\u00b10.024\n0.459\u00b10.011\n0.245\u00b10.024\n0.169\u00b10.012\n0.73\u00b10.015\n\n0.294\u00b10.067\n0.335\u00b10.02\n0.494\u00b10.004\n0.318\u00b10.024\n0.173\u00b10.011\n0.705\u00b10.039\n\nTable 4: Amortized Runtime in Online Deletion Benchmark (Train once + 1,000 Deletions)\n\nDataset\n\nCelltype\nCovtype\nMNIST\nPostures\nGaussian\nBotnet\n\nk-means\n\nRuntime (s)\n\nQ-k-means\n\nDC-k-means\n\nRuntime (s)\n\nSpeedup\n\nRuntime (s)\n\nSpeedup\n\n4.241\u00b10.248\n6.114\u00b10.216\n65.038\u00b11.528\n26.616\u00b11.222\n\n206.631\u00b167.285\n607.784\u00b164.687\n\n0.026\u00b10.011\n0.454\u00b10.276\n29.386\u00b10.728\n0.413\u00b10.305\n0.393\u00b10.104\n1.04\u00b10.368\n\n163.286\u00d7 0.272\u00b10.007\n13.464\u00d7 0.469\u00b10.021\n2.562\u00b10.056\n2.213\u00d7\n64.441\u00d7\n1.17\u00b10.398\n525.63\u00d7 5.992\u00b10.269\n584.416\u00d7 8.568\u00b10.652\n\n15.6\u00d7\n\n13.048\u00d7\n25.381\u00d7\n22.757\u00d7\n34.483\u00d7\n70.939\u00d7\n\nFigure 1: Online deletion ef\ufb01ciency: # of deletions vs. amortized runtime (secs) for 3 algorithms on 6 datasets.\n\n8\n\n\fIn particular, note that MNIST has the highest d/n ratio of the datasets we tried, followed by Covtype,\nThese two datasets are, respectively, the datasets for which Q-k-means offers the least speedup.\nOn the other hand, DC-k-means offers consistently increasing speedup as n increases, for \ufb01xed d.\nFurthermore, we see that Q-k-means tends to have higher variance around its deletion ef\ufb01ciency,\ndue to the randomness in centroid stabilization having a larger impact than the randomness in the\ndataset partitioning. We remark that 1,000 deletions is less than 10% of every dataset we test on, and\nstatistical performance remains virtually unchanged throughout the benchmark. In Figure 1, we plot\nthe amortized runtime on the online deletion benchmark as a function of number of deletions in the\nstream. We refer the reader to Appendix D for supplementary experiments providing more detail on\nour methods.\n\n6 Discussion\n\nAt present, the main options for deletion ef\ufb01cient supervised methods are linear models, support vector\nmachines, and non-parametric regressions. While our analysis here focuses on the concrete problem\nof clustering, we have proposed four design principles which we envision as the pillars of deletion\nef\ufb01cient learning algorithms. We discuss the potential application of these methods to other supervised\nlearning techniques.\n\nSegmented Regression Segmented (or piece-wise) linear regression is a common relaxation of\ncanonical regression models [58, 59, 57]. It should be possible to support a variant of segmented\nregression by combining Q-k-means with linear least squares regression. Each cluster could be given\na separate linear model, trained only on the datapoints in said cluster. At deletion time, Q-k-means\nwould likely keep the clusters stable, enabling a simple linear update to the model corresponding to the\ncluster from which the deleted point belonged.\n\nKernel Regression Kernel regressions in the style of random Fourier features [70] could be readily\namended to support ef\ufb01cient deletions for large-scale supervised learning. Random features do not\ndepend on data, and thus only the linear layer over the feature space requires updating for deletion.\nFurthermore, random Fourier feature methods have been shown to have af\ufb01nity for quantization [91].\n\nDecision Trees and Random Forests Quantization is also a promising approach for decision trees.\nBy quantizing or randomizing decision tree splitting criteria (such as in [36]) it seems possible to\nsupport ef\ufb01cient deletion. Furthermore, random forests have a natural af\ufb01nity with bagging, which\nnaturally can be used to impose modularity.\n\nDeep Neural Networks and Stochastic Gradient Descent A line of research has observed the\nrobustness of neural network training robustness to quantization and pruning [84, 46, 40, 71, 25, 52].\nIt could be possible to leverage these techniques to quantize gradient updates during SGD-style\noptimization, enabling a notion of parameter stability analgous to that in Q-k-means. This would\nrequire larger batch sizes and fewer gradient steps in order to scale well. It is also possible that\napproximate deletion methods may be able to overcome shortcomings of exact deletion methods for\nlarge neural models.\n\n7 Conclusion\n\nIn this work, we developed a notion of deletion ef\ufb01ciency for large-scale learning systems, proposed\nprovably deletion ef\ufb01cient unsupervised clustering algorithms, and identi\ufb01ed potential algorithmic\nprinciples that may enable deletion ef\ufb01ciency for other learning algorithms and paradigms. We have\nonly scratched the surface of understanding deletion ef\ufb01ciency in learning systems. Throughout, we\nmade a number of simplifying assumptions, such that there is only one model and only one database\nin our system. We also assumed that user-based deletion requests correspond to only a single data\npoint. Understanding deletion ef\ufb01ciency in a system with many models and many databases, as well as\ncomplex user-to-data relationships, is an important direction for future work.\n\nAcknowledgments: This research was partially supported by NSF Awards AF:1813049,\nCCF:1704417, and CCF 1763191, NIH R21 MD012867-01, NIH P30AG059307, an Of\ufb01ce of Naval\nResearch Young Investigator Award (N00014-18-1-2295), a seed grant from Stanford\u2019s Institute for\nHuman-Centered AI, and the Chan-Zuckerberg Initiative. We would also like to thank I. Lemhadri, B.\nHe, V. Bagaria, J. Thomas and anonymous reviewers for helpful discussion and feedback.\n\n9\n\n\fReferences\n\n[1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep\nlearning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on\nComputer and Communications Security, pages 308\u2013318. ACM, 2016.\n\n[2] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin. Learning from data, volume 4. AMLBook\n\nNew York, NY, USA:, 2012.\n\n[3] D. Aloise, A. Deshpande, P. Hansen, and P. Popat. Np-hardness of euclidean sum-of-squares\n\nclustering. Machine learning, 75(2):245\u2013248, 2009.\n\n[4] N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The\n\nAmerican Statistician, 46(3):175\u2013185, 1992.\n\n[5] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of\nthe eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027\u20131035. Society\nfor Industrial and Applied Mathematics, 2007.\n\n[6] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning for control. In Lazy\n\nlearning, pages 75\u2013113. Springer, 1997.\n\n[7] O. Bachem, M. Lucic, and A. Krause. Distributed and provably good seedings for k-means in\nconstant rounds. In Proceedings of the 34th International Conference on Machine Learning-\nVolume 70, pages 292\u2013300. JMLR. org, 2017.\n\n[8] B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means++.\n\nProceedings of the VLDB Endowment, 5(7):622\u2013633, 2012.\n\n[9] M.-F. F. Balcan, S. Ehrlich, and Y. Liang. Distributed k-means and k-median clustering on\ngeneral topologies. In Advances in Neural Information Processing Systems, pages 1995\u20132003,\n2013.\n\n[10] O. Berman and N. Ashra\ufb01. Optimization models for reliability of modular software systems.\n\nIEEE Transactions on Software Engineering, 19(11):1119\u20131123, 1993.\n\n[11] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets the recursive least squares\nalgorithm. In Proceedings of the 1998 Conference on Advances in Neural Information Processing\nSystems II, pages 375\u2013381, Cambridge, MA, USA, 1999. MIT Press.\n\n[12] J. A. Blackard and D. J. Dean. Comparative accuracies of arti\ufb01cial neural networks and dis-\ncriminant analysis in predicting forest cover types from cartographic variables. Computers and\nelectronics in agriculture, 24(3):131\u2013151, 1999.\n\n[13] D. Bogdanov, L. Kamm, S. Laur, and V. Sokk. Implementation and evaluation of an algorithm for\ncryptographically private principal component analysis on genomic data. IEEE/ACM transactions\non computational biology and bioinformatics, 15(5):1427\u20131432, 2018.\n\n[14] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage,\nA. Segal, and K. Seth. Practical secure aggregation for privacy-preserving machine learning. In\nProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security,\npages 1175\u20131191. ACM, 2017.\n\n[15] G. Bontempi, H. Bersini, and M. Birattari. The local paradigm for modeling and control: from\n\nneuro-fuzzy to lazy learning. Fuzzy sets and systems, 121(1):59\u201372, 2001.\n\n[16] R. Bost, R. A. Popa, S. Tu, and S. Goldwasser. Machine learning classi\ufb01cation over encrypted\n\ndata. In NDSS, 2015.\n\n[17] L. Bottou. Online learning and stochastic approximations. On-line learning in neural networks,\n\n17(9):142, 1998.\n\n[18] Y. Cao and J. Yang. Towards making systems forget with machine unlearning. In 2015 IEEE\n\nSymposium on Security and Privacy, pages 463\u2013480. IEEE, 2015.\n\n[19] G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning.\n\nIn Advances in neural information processing systems, pages 409\u2013415, 2001.\n\n10\n\n\f[20] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimiza-\n\ntion. Journal of Machine Learning Research, 12(Mar):1069\u20131109, 2011.\n\n[21] K. Chaudhuri, A. D. Sarwate, and K. Sinha. A near-optimal algorithm for differentially-private\n\nprincipal components. The Journal of Machine Learning Research, 14(1):2905\u20132943, 2013.\n\n[22] D. Coomans and D. L. Massart. Alternative k-nearest neighbour rules in supervised pattern\nrecognition: Part 1. k-nearest neighbour classi\ufb01cation by using alternative voting rules. Analytica\nChimica Acta, 136:15\u201327, 1982.\n\n[23] Council of European Union. Council regulation (eu) no 2012/0011, 2014. https://eur-lex.\n\neuropa.eu/legal-content/EN/TXT/?uri=CELEX:52012PC0011.\n\n[24] Council of European Union. Council regulation (eu) no 2016/678, 2014. https://eur-lex.\n\neuropa.eu/eli/reg/2016/679/oj.\n\n[25] M. Courbariaux, Y. Bengio, and J.-P. David. Training deep neural networks with low precision\n\nmultiplications. arXiv preprint arXiv:1412.7024, 2014.\n\n[26] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.\n\n[27] R. E. W. D. A. Belsley, E. Kuh. Regression Diagnostics: Identifying In\ufb02uential Data and Sources\n\nof Collinearity. John Wiley & Sons, Inc., New York, NY, USA, 1980.\n\n[28] S. Dasgupta and A. Gupta. An elementary proof of a theorem of johnson and lindenstrauss.\n\nRandom Structures and Algorithms, 22(1):60\u201365, 2003.\n\n[29] L. Devroye and T. Wagner. Distribution-free performance bounds for potential function rules.\n\nIEEE Transactions on Information Theory, 25(5):601\u2013604, 1979.\n\n[30] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and\n\nTrends in Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\n\n[31] Z. Erkin, T. Veugen, T. Toft, and R. L. Lagendijk. Generating private recommendations ef\ufb01ciently\nusing homomorphic encryption and data packing. IEEE transactions on information forensics\nand security, 7(3):1053\u20131066, 2012.\n\n[32] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Number 10.\n\nSpringer series in statistics New York, 2001.\n\n[33] K. J. Galinsky, P.-R. Loh, S. Mallick, N. J. Patterson, and A. L. Price. Population structure of\nuk biobank and ancient eurasians reveals adaptation at genes in\ufb02uencing blood pressure. The\nAmerican Journal of Human Genetics, 99(5):1130\u20131139, 2016.\n\n[34] A. Gardner, C. A. Duncan, J. Kanno, and R. Selmic. 3d hand posture recognition from small\nunlabeled point sets. In 2014 IEEE International Conference on Systems, Man, and Cybernetics\n(SMC), pages 164\u2013169. IEEE, 2014.\n\n[35] A. Gardner, J. Kanno, C. A. Duncan, and R. Selmic. Measuring distance between unordered\nsets of different sizes. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 137\u2013143, 2014.\n\n[36] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine learning, 63(1):3\u201342,\n\n2006.\n\n[37] A. Ghorbani and J. Zou. Data shapley: Equitable valuation of data for machine learning. arXiv\n\npreprint arXiv:1904.02868, 2019.\n\n[38] R. M. Gray and D. L. Neuhoff. Quantization.\n\nIEEE transactions on information theory,\n\n44(6):2325\u20132383, 1998.\n\n[39] S. Guha, R. Rastogi, and K. Shim. Cure: an ef\ufb01cient clustering algorithm for large databases. In\n\nACM Sigmod Record, pages 73\u201384. ACM, 1998.\n\n[40] P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi. Ristretto: A framework for empirical study\nof resource-ef\ufb01cient inference in convolutional neural networks. IEEE Transactions on Neural\nNetworks and Learning Systems, 2018.\n\n11\n\n\f[41] S. Hammarling and C. Lucas. Updating the qr factorization and the least squares problem. Tech.\n\nReport, The University of Manchester (2008), 2008.\n\n[42] X. Han, R. Wang, Y. Zhou, L. Fei, H. Sun, S. Lai, A. Saadatpour, Z. Zhou, H. Chen, F. Ye, et al.\n\nMapping the mouse cell atlas by microwell-seq. Cell, 172(5):1091\u20131107, 2018.\n\n[43] N. J. Higham. Accuracy and stability of numerical algorithms, volume 80. Siam, 2002.\n\n[44] A. Hinneburg and H.-H. Gabriel. Denclue 2.0: Fast clustering based on kernel density estimation.\n\nIn International symposium on intelligent data analysis, pages 70\u201380. Springer, 2007.\n\n[45] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space.\n\nContemporary mathematics, 26(189-206):1, 1984.\n\n[46] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden,\nA. Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Computer\nArchitecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 1\u201312.\nIEEE, 2017.\n\n[47] M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-\n\nvalidation. Neural computation, 11(6):1427\u20131453, 1999.\n\n[48] A. Knoblauch. Closed-form expressions for the moments of the binomial probability distribution.\n\nSIAM Journal on Applied Mathematics, 69(1):197\u2013204, 2008.\n\n[49] Z. F. Knops, J. A. Maintz, M. A. Viergever, and J. P. Pluim. Normalized mutual information\nbased registration using k-means clustering and shading correction. Medical image analysis,\n10(3):432\u2013439, 2006.\n\n[50] S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error: Tech.\nrep. Technical report, TR-2002-03: University of Chicago, Computer Science Department, 2002.\n\n[51] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[52] D. Lin, S. Talathi, and S. Annapureddy. Fixed point quantization of deep convolutional networks.\n\nIn International Conference on Machine Learning, pages 2849\u20132858, 2016.\n\n[53] S. Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129\u2013\n\n137, 1982.\n\n[54] D. G. Lowe et al. Object recognition from local scale-invariant features. In ICCV, number 2,\n\npages 1150\u20131157, 1999.\n\n[55] J. H. Maindonald. Statistical Computation. John Wiley & Sons, Inc., New York, NY, USA, 1984.\n\n[56] Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, D. Breitenbacher, and Y. Elovici.\nN-baiot\u2014network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive\nComputing, 17(3):12\u201322, 2018.\n\n[57] V. M. Muggeo. Estimating regression models with unknown break-points. Statistics in medicine,\n\n22(19):3055\u20133071, 2003.\n\n[58] V. M. Muggeo. Testing with a nuisance parameter present only under the alternative: a score-\nbased approach with application to segmented modelling. Journal of Statistical Computation and\nSimulation, 86(15):3059\u20133067, 2016.\n\n[59] V. M. Muggeo et al. Segmented: an r package to \ufb01t regression models with broken-line relation-\n\nships. R news, 8(1):20\u201325, 2008.\n\n[60] S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: stability is suf\ufb01cient\nfor generalization and necessary and suf\ufb01cient for consistency of empirical risk minimization.\nAdvances in Computational Mathematics, 25(1-3):161\u2013193, 2006.\n\n[61] E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141\u2013142,\n\n1964.\n\n12\n\n\f[62] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, and N. Taft. Privacy-preserving\nridge regression on hundreds of millions of records. In Security and Privacy (SP), 2013 IEEE\nSymposium on, pages 334\u2013348. IEEE, 2013.\n\n[63] O. Ohrimenko, F. Schuster, C. Fournet, A. Mehta, S. Nowozin, K. Vaswani, and M. Costa.\nOblivious multi-party machine learning on trusted processors. In USENIX Security Symposium,\npages 619\u2013636, 2016.\n\n[64] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar. Semi-supervised knowledge\n\ntransfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016.\n\n[65] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-\nhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine\nlearning research, 12(Oct):2825\u20132830, 2011.\n\n[66] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-\ntenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,\nand E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning\nResearch, 12:2825\u20132830, 2011.\n\n[67] D. Peleg. Distributed computing. SIAM Monographs on discrete mathematics and applications,\n\n5:1\u20131, 2000.\n\n[68] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning\n\ntheory. Nature, 428(6981):419, 2004.\n\n[69] J. Qin, W. Fu, H. Gao, and W. X. Zheng. Distributed k-means algorithm and fuzzy c-means\nalgorithm for sensor networks based on multiagent consensus theory. IEEE transactions on\ncybernetics, 47(3):772\u2013783, 2016.\n\n[70] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural\n\ninformation processing systems, pages 1177\u20131184, 2008.\n\n[71] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classi\ufb01cation using\nbinary convolutional neural networks. In European Conference on Computer Vision, pages\n525\u2013542. Springer, 2016.\n\n[72] P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.\n\nJournal of computational and applied mathematics, 20:53\u201365, 1987.\n\n[73] V. Schellekens and L. Jacques. Quantized compressive k-means. IEEE Signal Processing Letters,\n\n25(8):1211\u20131215, 2018.\n\n[74] S. Schelter. \u201camnesia\u201d\u2013towards machine learning models that can forget user data very fast. In\n1st International Workshop on Applied AI for Database Systems and Applications (AIDB\u201919),\n2019.\n\n[75] F. Schomm, F. Stahl, and G. Vossen. Marketplaces for data: an initial survey. ACM SIGMOD\n\nRecord, 42(1):15\u201326, 2013.\n\n[76] B. Schrauwen, D. Verstraeten, and J. Van Campenhout. An overview of reservoir computing:\ntheory, applications and implementations. In Proceedings of the 15th european symposium on\narti\ufb01cial neural networks. p. 471-482 2007, pages 471\u2013482, 2007.\n\n[77] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform\n\nconvergence. Journal of Machine Learning Research, 11(Oct):2635\u20132670, 2010.\n\n[78] C. E. Shannon. Communication theory of secrecy systems. Bell system technical journal,\n\n28(4):656\u2013715, 1949.\n\n[79] C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green,\nM. Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range\nof complex diseases of middle and old age. PLoS medicine, 12(3):e1001779, 2015.\n\n[80] H.-L. Truong, M. Comerio, F. De Paoli, G. Gangadharan, and S. Dustdar. Data contracts for\ncloud-based data marketplaces. International Journal of Computational Science and Engineering,\n7(4):280\u2013295, 2012.\n\n13\n\n\f[81] C.-H. Tsai, C.-Y. Lin, and C.-J. Lin. Incremental and decremental training for linear classi\ufb01cation.\nIn Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 343\u2013352. ACM, 2014.\n\n[82] S. Van Der Walt, S. C. Colbert, and G. Varoquaux. The numpy array: a structure for ef\ufb01cient\n\nnumerical computation. Computing in Science & Engineering, 13(2):22, 2011.\n\n[83] C. F. Van Loan and G. H. Golub. Matrix computations. Johns Hopkins University Press, 1983.\n\n[84] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus.\n\nCiteseer.\n\n[85] M. Veale, R. Binns, and L. Edwards. Algorithms that remember: model inversion attacks and\ndata protection law. Philosophical Transactions of the Royal Society A: Mathematical, Physical\nand Engineering Sciences, 376(2133):20180083, 2018.\n\n[86] E. F. Villaronga, P. Kieseberg, and T. Li. Humans forget, machines remember: Arti\ufb01cial\nintelligence and the right to be forgotten. Computer Law & Security Review, 34(2):304\u2013313,\n2018.\n\n[87] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison:\nVariants, properties, normalization and correction for chance. Journal of Machine Learning\nResearch, 11(Oct):2837\u20132854, 2010.\n\n[88] G. I. Webb. Lazy Learning, pages 571\u2013572. Springer US, 2010.\n\n[89] J. Yin and Y. Meng. Self-organizing reservior computing with dynamically regulated cortical\nneural networks. In The 2012 International Joint Conference on Neural Networks (IJCNN), pages\n1\u20137. IEEE, 2012.\n\n[90] S. Zeb and M. Yousaf. Updating qr factorization procedure for solution of linear least squares\nproblem with equality constraints. Journal of inequalities and applications, 2017(1):281, 2017.\n\n[91] J. Zhang, A. May, T. Dao, and C. R\u00e9. Low-precision random fourier features for memory-\n\nconstrained kernel approximation. arXiv preprint arXiv:1811.00155, 2018.\n\n[92] W. Zhao, H. Ma, and Q. He. Parallel k-means clustering based on mapreduce. In IEEE Interna-\n\ntional Conference on Cloud Computing, pages 674\u2013679. Springer, 2009.\n\n14\n\n\f", "award": [], "sourceid": 1917, "authors": [{"given_name": "Antonio", "family_name": "Ginart", "institution": "Stanford University"}, {"given_name": "Melody", "family_name": "Guan", "institution": "Stanford University"}, {"given_name": "Gregory", "family_name": "Valiant", "institution": "Stanford University"}, {"given_name": "James", "family_name": "Zou", "institution": "Stanford"}]}