{"title": "Near-Optimal Entrywise Sampling for Data Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 1565, "page_last": 1573, "abstract": "We consider the problem of independently sampling $s$ non-zero entries of a matrix $A$ in order to produce a sparse sketch of it, $B$, that minimizes $\\|A-B\\|_2$. For large $m \\times n$ matrices, such that $n \\gg m$ (for example, representing $n$ observations over $m$ attributes) we give distributions exhibiting four important properties. First, they have closed forms for the probability of sampling each item which are computable from minimal information regarding $A$. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with $O(1)$ computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal offline distribution. Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model.", "full_text": "Near-Optimal Entrywise Sampling for Data Matrices\n\nDimitris Achlioptas\n\nUC Santa Cruz\n\noptas@cs.ucsc.edu\n\nZohar Karnin\nYahoo Labs\n\nzkarnin@ymail.com\n\nEdo Liberty\nYahoo Labs\n\nedo.liberty@ymail.com\n\nAbstract\n\nWe consider the problem of selecting non-zero entries of a matrix A in order to\nproduce a sparse sketch of it, B, that minimizes A B 2. For large m n matri-\nces, such that n m (for example, representing n observations over m attributes)\nwe give sampling distributions that exhibit four important properties. First, they\nhave closed forms computable from minimal information regarding A. Second,\nthey allow sketching of matrices whose non-zeros are presented to the algorithm\nin arbitrary order as a stream, with O 1 computation per non-zero. Third, the\nresulting sketch matrices are not only sparse, but their non-zero entries are highly\ncompressible. Lastly, and most importantly, under mild assumptions, our distri-\nbutions are provably competitive with the optimal of\ufb02ine distribution. Note that\nthe probabilities in the optimal of\ufb02ine distribution may be complex functions of\nall the entries in the matrix. Therefore, regardless of computational complexity,\nthe optimal distribution might be impossible to compute in the streaming model.\n\n1\n\nIntroduction\n\nGiven an m n matrix A, it is often desirable to \ufb01nd a sparser matrix B that is a good proxy\nfor A. Besides being a natural mathematical question, such sparsi\ufb01cation has become a ubiqui-\ntous preprocessing step in a number of data analysis operations including approximate eigenvector\ncomputations [AM01, AHK06, AM07], semi-de\ufb01nite programming [AHK05, d\u2019A08], and matrix\ncompletion problems [CR09, CT10].\nA fruitful measure for the approximation of A by B is the spectral norm of A B, where for any\nmatrix C its spectral norm is de\ufb01ned as C 2 max x 2 1 Cx 2. Randomization has been central\nin the context of matrix approximations and the overall problem is typically cast as follows: given a\nmatrix A and a budget s, devise a distribution over matrices B such that the (expected) number of\nnon-zero entries in B is at most s and A B 2 is as small as possible.\nOur work is motivated by big data matrices that are generated by measurement processes. Each\nof the n matrix columns correspond to an observation of m attributes. Thus, we expect n m.\nAlso we expect the total number of non-zero entries in A to exceed available memory. We assume\nthat the original data matrix A is accessed in the streaming model where we know only very basic\nfeatures of A a priori and the actual non-zero entries are presented to us one at a time in an arbitrary\norder. The streaming model is especially important for tasks like recommendation engines where\nuser-item preferences become available one by one in an arbitrary order. But, it is also important in\ncases when A exists in durable storage and random access of its entries is prohibitively expensive.\nWe establish that for such matrices the following approach gives provably near-optimal sparsi\ufb01ca-\ntion. Assign to each element Aij of the matrix a weight that depends only on the elements in its\nrow qij\nAij A i 1. Take \u21e2 to be an (appropriate) distribution over the rows. Sample s i.i.d.\nlocations from A using the distribution pij\n\u21e2iqij. Return B which is the mean of s matrices, each\ncontaining a single non zero entry Aij pij in the corresponding selected location i, j .\nAs we will see, this simple form of the probabilities pij falls out naturally from generic optimization\nconsiderations. The fact that each entry is kept with probability proportional to its magnitude, be-\n\n1\n\n\fkij\n\ns log n s\n\n0, 1 . The result is a matrix B which is representable in O m log n\n\nsides being interesting on its own right, has a remarkably practical implication. Every non-zero in the\ni-th row of B will take the form kij A i 1 s\u21e2i where kij is the number of times location i, j of\nA was selected. Note that since we sample with replacement kij may be more than 1 but, typically,\nbits.\nkij\nThis is because there is no reason to store \ufb02oating point matrix entry values. We use O m log n\nbits to store1 all values A i 1 s\u21e2i and O s log n s\nbits to store the non zero index offsets. Note\nthat\ns and that some of the offsets may be zero. In a simple experiment we measured\nthe average number of bits per sample resulting from this approach (total size of the sketch divided\nby the number of samples s). The results were between 5 and 22 bits per sample depending on the\nmatrix and s. It is important to note that the number of bits per sample was usually less than even\nlog2 m , the minimal number of bits required to represent a pair i, j . Our experiments\nlog2 n\nshow a reduction of disc space by a factor of between 2 and 5 relative to the compressed size of the\n\ufb01le representing the sample matrix B in the standard row-column-value list format.\nAnother insight of our work is that the distributions we propose are combinations of two L1-based\ndistributions and and which distribution is more dominant depends on the sampling budget. When\nthe number of samples s is small, \u21e2i is nearly linear in A i 1 resulting in pij Aij . However, as\nthe number of samples grows, \u21e2i tends towards A i\n1 resulting in pij Aij A i 1, a distribution\n2\nwe refer to as Row-L1 sampling. The dependence of the preferred distribution on the sample budget\nis also borne out in experiments, with sampling based on appropriately mixed distributions being\nconsistently best. This highlights that the need to adapt the sampling distribution to the sample\nbudget is a genuine phenomenon.\n\n2 Measure of Error and Related Work\n\nWe measure the difference between A and B with respect to the L2 (spectral) norm as it is highly\nrevealing in the context of data analysis. Let us de\ufb01ne a linear trend in the data of A as any tendency\nof the rows to align with a particular unit vector x. To examine the presence of such a trend, we need\nonly multiply A with x: the ith coordinate of Ax is the projection of the ith row of A onto x. Thus,\nAx 2 measures the strength of linear trend x in A, and A 2 measures the strongest linear trend in\nA. Thus, minimizing A B 2 minimizes the strength of the strongest linear trend of A not captured\nby B. In contrast, measuring the difference using an entry-wise norm, e.g., the Frobenius norm, can\nbe completely uninformative. This is because the best strategy would be to always pick the largest\ns matrix entries from A, a strategy that can easily be \u201cfooled\u201d. As a stark example, when the matrix\nentries are Aij\n0, 1 , the quality of approximation of A by B is completely independent of which\nelements of A we keep. This is clearly bad; as long as A contains even a modicum of structure\ncertain approximations will be far better than others.\nBy using the spectral norm to measure error we get a natural and sophisticated target: to minimize\nA B 2 is to make E A B a near-rotation, having only small variations in the amount by which\nit stretches different vectors. This idea that the error matrix E should be isotropic, thus packing as\nmuch Frobenius norm as possible for its L2 norm, motivated the \ufb01rst work on element-wise matrix\nsampling by Achlioptas and McSherry [AM07]. Concretely, to minimize E 2 it is natural to aim\nfor a matrix E that is both zero-mean, i.e., an unbiased estimator of A, and whose entries are formed\nby sampling the entries of A (and, thus, of E) independently. In the work of [AM07], E is a matrix\nof i.i.d. zero-mean random variables. The study of the spectral characteristics of such matrices\ngoes back all the way to Wigner\u2019s famous semi-circle law [Wig58]. Speci\ufb01cally, to bound E 2\nin [AM07] a bound due to Alon Krivelevich and Vu [AKV02] was used, a re\ufb01nement of a bound\nby Juh\u00b4asz [Juh81] and F\u00a8uredi and Koml\u00b4os [FK81]. The most salient feature of that bound is that it\ndepends on the maximum entry-wise variance 2 of A B, and therefore the distribution optimizing\nthe bound is the one in which the variance of all entries in E is the same. In turn, this means keeping\neach entry of A independently with probability pij A2\nSeveral papers have since analyzed L2-sampling and variants [NDT09, NDT10, DZ11, GT09,\nAM07]. An inherent dif\ufb01culty of L2-sampling based strategies is the need for special handling\nij, the result-\nof small entries. This is because when each item Aij is kept with probability pij A2\n\nij (up to a small wrinkle discussed below).\n\n1It is harmless to assume any value in the matrix is kept using O log n bits of precision. Otherwise,\n\ntruncating the trailing bits can be shown to be negligible.\n\n2\n\n\fing entry Bij in the sample matrix has magnitude Aij pij Aij\n1. Thus, if an extremely small\nelement Aij is accidentally picked, the largest entry of the sample matrix \u201cblows up\u201d. In [AM07]\nthis was addressed by sampling small entries with probability proportional to Aij rather than A2\nij.\nIn the work of Gittens and Tropp [GT09], small entries are not handled separately and the bound\nderived depends on the ratio between the largest and the smallest non-zero magnitude.\nRandom matrix theory has witnessed dramatic progress in the last few years and [AW02, RV07,\nTro12a, Rec11] provide a good overview of the results. This progress motivated Drineas and Zouzias\nin [DZ11] to revisit L2-sampling using concentration results for sums of random matrices [Rec11],\nas we do here. This is somewhat different from the original setting of [AM07] since now B is not\na random matrix with independent entries, but a sum of many single-element independent matrices,\neach such matrix resulting by choosing a location of A with replacement. Their work improved\nupon all previous L2-based sampling results and also upon the L1-sampling result of Arora, Hazan\nand Kale [AHK06], discussed below, while admitting a remarkably compact proof. The issue of\nsmall entries was handled in [DZ11] by deterministically discarding all suf\ufb01ciently small entries, a\nstrategy that gives a strong mathematical guarantee (but see the discussion regarding deterministic\ntruncation in the experimental section).\nA completely different tack at the problem, avoiding random matrix theory altogether, was taken\nby Arora et al. [AHK06]. Their approximation keeps the largest entries in A deterministically\n(speci\ufb01cally all Aij\nn where the threshold \" needs be known a priori) and randomly rounds\nthe remaining smaller entries to sign Aij \"\n1 xT A B y by noting that, as a scalar quantity, its concentration around its expec-\nsup x\ntation can be established by standard Bernstein-Bennet type inequalities. A union bound then allows\nthem to prove that with high probability, xT A B y\n\" for every x and y. The result of [AHK06]\nadmits a relatively simple proof. However, it also requires a truncation that depends on the desired\napproximation \". Rather interestingly, this time the truncation amounts to keeping every entry larger\nthan some threshold.\n\nn or 0. They exploit the simple fact A B\n\n1, y\n\n\"\n\n3 Our Approach\n\nFollowing the discussion in Section 2 and in line with previous works, we: (i) measure the quality\nof B by A B 2, (ii) sample the entries of A independently, and (iii) require B to be an unbiased\nestimator of A. We are therefore left with the task of determining a good probability distribution pij\nfrom which to sample the entries of A in order to get B. As discussed in Section 2 prior art makes\nheavy use of beautiful results in the theory of random matrices. Speci\ufb01cally, each work proposes a\nspeci\ufb01c sampling distribution and then uses results from random matrix theory to demonstrate that it\nhas good properties. In this work we reverse the approach, aiming for its logical conclusion. We start\nfrom a cornerstone result in random matrix theory and work backwards to reverse-engineer near-\noptimal distributions with respect to the notion of probabilistic deviations captured by the inequality.\nThe inequality we use is the Matrix-Bernstein inequality for sums of independent random matrices\n(see e.g., [Tro12b], Theorem 1.6). In the following, we often denote A 2 as A to lighten notation.\nTheorem 3.1 (Matrix Bernstein inequality). Consider a \ufb01nite sequence Xi of i.i.d. random m n\nmatrices, where E X1\nFor some \ufb01xed s\n\nR. Let 2 max E X1X T\n1\n\nXs s. For all \"\n\n0 and X1\n\n, E X T\n\n1, let X\n\n1 X1\n\n.\n\nX1\n\n0,\n\nPr X\n\n\"\n\nm n exp\n\ns\"2\n\n2 R\" 3\n\n.\n\nTo get a feeling for our approach, \ufb01x any probability distribution p over the non-zero elements of\nA. Let B be a random m n matrix with exactly one non-zero element, formed by sampling an\nelement Aij of A according to p and letting Bij Aij pij. Observe that for every i, j , regardless\nof the choice of p, we have E Bij\nAij, and thus B is always an unbiased estimator of A. Clearly,\nthe same is true if we repeat this s times, taking i.i.d. samples B1, . . . , Bs, and let our matrix B\nbe their average. With this approach in mind, the goal is now to \ufb01nd a distribution p minimizing\nA Bs we see that sE is the\nE\noperator norm of a sum of i.i.d. zero-mean random matrices Xi A Bi, i.e., exactly the setting\n\nBs s . Writing sE\n\nA B1\n\nA B1\n\n3\n\n\fof Theorem 3.1. The relevant parameters are\n\n2\nR\n\nmax E A B1 A B1\nmax A B1\n\nT , E A B1\n\nT A B1\n\nover all possible realizations of B1 .\n\n(1)\n(2)\n\nEquations (1) and (2) mark the starting point of our work. Our goal is to \ufb01nd probability distributions\nover the elements of A that optimize (1) and (2) simultaneously with respect to their functional form\nin Theorem 3.1, thus yielding the strongest possible bound on A B . A conceptual contribution\nof our work is the discovery that good distributions depend on the sample budget s, a fact also borne\nout in experiments. The fact that minimizing the deviation metric of Theorem 3.1, i.e., 2 R\u270f 3,\nsuf\ufb01ces to bring out this dependence can be viewed as testament to the theorem\u2019s sharpness.\nTheorem 3.1 is stated as a bound on the probability that the norm of the error matrix is greater than\nsome target error \" given the number of samples s. In practice, the target error \" is typically not\nknown in advance, but rather is the quantity to minimize, given the matrix A, the number of samples\ns, and the target con\ufb01dence . Speci\ufb01cally, for any given distribution p on the elements of A, de\ufb01ne\n\n\"1 p\n\ninf\n\n\" : m n exp\n\ns\"2\n\n p 2 R p \" 3\n\n\n\n.\n\n(3)\n\nOur goal in the rest of the paper is to seek the distribution p minimizing \"1. Our result is an easily\ncomputable distribution p which comes within a factor of 3 of \"1 p\nand, as a result, within a factor\nof 9 in terms of sample complexity (in practice we expect this to be even smaller, as the factor of\n3 comes from consolidating bounds for a number of different worst-case matrices). To put this in\nperspective note that the de\ufb01nition of p does not place any restriction either on the access model\nfor A while computing p , or on the amount of time needed to compute p . In other words, we are\ncompeting against an oracle which in order to determine p has all of A in its purview at once and\ncan spend an unbounded amount of computation to determine it.\nIn contrast, the only global information regarding A we require are the ratios between the L1 norms\nof the rows of the matrix. Trivially, the exact L1 norms of the rows (and therefore their ratios) can\nbe computed in a single pass over the matrix, yielding a 2-pass algorithm. Slightly less trivially,\nstandard concentration arguments imply that these ratios can be estimated very well by sampling\nonly a small number of columns. In the setting of data analysis, though, it is in fact reasonable\nto expect that good estimates of these ratios are available a priori. This is because different rows\ncorrespond to different attributes and the ratios between the row norms re\ufb02ect the ratios between the\naverage absolute values of the features. For example, if the matrix corresponds to text documents,\nknowing the ratios amounts to knowing global word frequencies. Moreover these ratios do not need\nto be known exactly to apply the algorithm, as even rough estimates of them give highly competitive\nresults. Indeed, even disregarding this issue completely and simply assuming that all ratios equal 1,\nyields an algorithm that appears quite competitive in practice, as demonstrated by our experiments.\n\n4 Data Matrices and Statement of Results\n\nThroughout A i and A j will denote the i-th row and j-th column of A, respectively. Also, we\nuse the notation A 1\nij. Before we formally state our result we\nintroduce a de\ufb01nition that expresses the class of matrices for which our results hold.\nDe\ufb01nition 4.1. An m n matrix A is a Data matrix if:\n\ni,j Aij and A 2\nF\n\ni,j A2\n\n1. mini A i 1 maxj A j\n2. A 2\n3. m 30.\n\n1 A 2\n2\n\n30m.\n\n1.\n\nRegarding Condition 1, recall that we think of A as being generated by a measurement process\nof a \ufb01xed number of attributes (rows), each column corresponding to an observation. As a result,\ncolumns have bounded L1 norm, i.e., A j\nconstant. While this constant may depend on\nthe type of object and its dimensionality, it is independent of the number of objects. On the other\nhand, A i 1 grows linearly with the number of columns (objects). As a result, we can expect\nDe\ufb01nition 4.1 to hold for all large enough data sets. Regarding Condition 2, it is easy to verify that\n\n1\n\n4\n\n\funless the values of the entries of A exhibit unbounded variance as n grows, the ratio A 2\n1 A 2\n2\ngrows as \u2326 n and Condition 2 follows from n m. Condition 3 is trivial. All in all, out of the\nthree conditions the essential one is Condition 1. The other two are merely technical and hold in all\nnon-trivial cases where Condition 1 applies.\nOne last point is that to apply Theorem 3.1, the entries of A must be sampled with replacement.\nA simple way to achieve this in the streaming model was presented in [DKM06] that uses O s\noperations per matrix element and O s active memory. In Section D (see supplementary material)\nwe discuss how to implement sampling with replacement far more ef\ufb01ciently, using O log s active\nmemory, \u02dcO s space, and O 1 operations per element. To simplify the exposition of our algorithm,\nbelow, we describe it in the non-streaming setting. That is, we assume we know m and n and that\nwe can compute numbers zi A i 1 as well as repeatedly sample entries from the matrix. We\nstress, however, that these conditions are not required and that the algorithm can be implemented\nef\ufb01ciently in the streaming model as discussed in Section D.\n\nAlgorithm 1 Construct a sketch B of a data matrix A\n1: Input: Data matrix A Rm n, sampling budget s, acceptable failure probability \n2: Set \u21e2\n3: Sample s elements of A with replacement, each Aij having probability pij\n4: For each sample i, j, Aij `, let B` be the matrix with B` i, j\n5: Output: B 1\ns\n\nCOMPUTEROWDISTRIBUTION(A, s, )\n\ns\n` 1 B`.\n\n\u21e2i Aij A i 1\nAij pij and zero elsewhere.\n\n6: function COMPUTEROWDISTRIBUTION(A, s, )\n7:\n8:\n\nObtain z such that zi A i 1 for i m\nSet \u21b5\n\n\nlog m n s\n\nand\n\nlog m n \n\n3s\n\n9:\n\n10:\n11:\n\n\u21b5zi 2\u21e3\n\nDe\ufb01ne \u21e2i \u21e3\nFind \u21e31 such that m\nreturn \u21e2 such that \u21e2i\n\ni 1 \u21e2i \u21e31\n\n\u21b5zi 2\u21e3 2\n\nzi \u21e3\n\n2\n\n1\n\n\u21e2i \u21e31 for i m\n\nSteps 6\u201311 compute a distribution \u21e2 over the rows. Assuming step 7 can be implemented ef\ufb01-\nciently (or skipped altogether in the case z are known a priori), we see that the running time of\nComputeRowDistribution is independent of n. Speci\ufb01cally, \ufb01nding \u21e31 in step 10 can be done\nef\ufb01ciently by binary search because the function\ni \u21e2i \u21e3 is strictly decreasing in \u21e3. Conceptually,\nwe see that the probability assigned to each element Aij in Step 3 is simply the probability \u21e2i of its\nrow times its intra-row weight Aij A i 1.\nWe are now able to state our main lemma. We defer its proof to Section 5 and subsequent details to\nappendices (see supplementary material).\nTheorem 4.2. If A is a Data matrix per De\ufb01nition 4.1 and p is the probability distribution de\ufb01ned\nin Algorithm 1, then \"1 p\n\n3 \"1 p , where p is the minimizer of \"1.\n\nTo compare our result with previous ones we \ufb01rst de\ufb01ne several matrix metrics. We then state the\nbound implied by Theorem 4.2 on the minimal number of samples s0 needed by our algorithm to\nachieve an approximation B to the matrix A such that A B\n\" A with constant probability.\nStable rank: Denoted as sr and de\ufb01ned as A 2\n2. This is a smooth analog for the algebraic\nrank, always bounded by it from above, and resilient to small perturbations of the matrix. For data\nmatrices we expect it is small, even constant, and that it captures the \u201cinherent dimensionality\u201d of\nthe data.\nNumeric density: Denoted as nd and de\ufb01ned as A 2\nF , this is a smooth analog of the number\nof non-zero entries nnz A . For 0-1 matrices it equals nnz A , but when there is variance in the\nmagnitude of the entries it is smaller.\nNumeric row density: Denoted as nrd and de\ufb01ned as\nclose to the average numeric density of a single row, a quantity typically much smaller than n.\n\nn. In practice, it is often\n\nF A 2\n\n2\n1 A 2\nF\n\n1 A 2\n\ni A i\n\n5\n\n\fTheorem 4.3. Let A be a Data Matrix per De\ufb01nition 4.1 and let B be the matrix returned by\nAlgorithm 1 for \n\n0 and any\n\n1 10, \"\n\ns\n\ns0 \u21e5 nrd sr \"2 log n\n\nsr nd \"2 log n 1 2\n\n.\n\nWith probability at least 9 10, A B\n\n\" A .\n\nThe proof of Theorem 4.3 is given in Appendix C (see supplementary material).\nThe third column of the table below shows the number of samples needed to guarantee that\n\" A occurs with constant probability, in terms of the matrix metrics de\ufb01ned above.\nA B\nThe fourth column presents the ratio of the samples needed by previous results divided by the sam-\nples needed by our method. (To simplify the expressions, we present the ratio between our bound\nand [AHK06] only when the result of [AHK06] gives superior bounds to [DZ11], i.e., we always\ncompare our bound to the stronger of the two bounds implied by these works). Holding \" and the\nstable rank constant we readily see that our method requires roughly 1\nn the samples needed\nby [AHK06]. In the comparison with [DZ11] we see that the key parameter is the ratio nrd n, a\nquantity typically much smaller than 1 for data matrices. As a point of reference for the assumptions,\nin the experimental Section 6 we provide the values of all relevant matrix metrics for all the real data\nmatrices we worked with, wherein the ratio nrd n is typically around 10 2. By this discussion, one\nwould expect that L2-sampling should fare better than L1-sampling in experiments. As we will see,\nquite the opposite is true. A potential explanation for this phenomenon is the relative looseness of\nthe bound of [AHK06] for the performance of L1-sampling.\n\nMethod\nL1, L2\n\nNumber of samples needed\nsr n \"2\nn polylog n\n\nImprovement ratio of Theorem 4.3\n\nsr n \"2 log n\nnd n \"2 1 2\nnrd sr \"2 log n\nsr nd \"2 log n 1 2\n\nnrd n\n\nnd n\n\n\"\n\nsr log n\n\nsr log n n\n\nCitation\n[AM07]\n[DZ11]\n[AHK06]\n\nL2\nL1\n\nThis paper\n\nBernstein\n\n5 Proof outline\n\nWe start by iteratively replacing the objective functions (1) and (2) with simpler and simpler func-\ntions. Each replacement will incur a (small) loss in accuracy but will bring us closer to a function\nfor which we can give a closed form solution. Recalling the de\ufb01nitions of \u21b5, from Algorithm 1\nand rewriting the requirement in (3) as a quadratic form in \" gives \"2\n0. Our \ufb01rst\n\"R\n0 has one negative and one\nstep is to observe that for any c, d\nd\npositive solution and that the latter is at least c\nd. Therefore, if we\nde\ufb01ne2 \"2 : \u21b5\nOur next simpli\ufb01cation encompasses Conditions 2, 3 of De\ufb01nition 4.1. Let \"3 : \u21b5\u02dc\n\n0, the equation \"2\nd\n1.\n\n\" c\n2 and at most c\n\nR we see that 1\n\n \u02dcR where\n\n\u21b5 2\n\n2\n\n\"1 \"2\n\n\u02dc2 : max max\n\ni\n\nj\n\nA2\n\nij pij , max\n\nj\n\nA2\n\nij pij\n\ni\n\nand\n\n\u02dcR : max\nij\n\nAij pij .\n\nLemma 5.1. For every matrix A satisfying Conditions 2 and 3 of De\ufb01nition 4.1, for every probability\ndistribution on the elements of A, \"2 \"3\n\n1 30.\n\n1\n\n and \u02dcR R.\nLemma 5.1 is proved in section A (see supplementary material) by showing that \u02dc\nThis allows us to optimize p with respect to \"3 instead of \"2. In minimizing \"3 we see that there is\nfreedom to use different rows to optimize \u02dc and \u02dcR. At a cost of a factor of 2, we will couple the two\n\n2Here and in the following, to lighten notation, we will omit all arguments, i.e., p, p , R p , from the\n\nobjective functions \"i we seeks to optimize, as they are readily understood from context.\n\n6\n\n\fminimizations by minimizing \"4 max \"5,\" 6 where\n\n\"5 : max\n\ni\n\n\u21b5\n\nA2\nij\npij\n\nj\n\n max\n\nj\n\nAij\npij\n\n,\"\n\n6 : max\n\nj\n\n\u21b5\n\nA2\nij\npij\n\ni\n\n max\n\ni\n\nAij\npij\n\n.\n\n(4)\n\nNote that the maximization of \u02dcR in \"5 (and \"6) is coupled with that of the \u02dc-related term by con-\nstraining the optimization to consider only one row (column) at a time. Clearly, 1\nNext we focus on \"5, the \ufb01rst term in the maximization of \"4. The following key lemma establishes\nthat for all data matrices satisfying Condition 1 of De\ufb01nition 4.1, by minimizing \"5 we also minimize\n\"4 max \"5,\" 6 .\nLemma 5.2. For every matrix satisfying Condition 1 of De\ufb01nition 4.1, argminp \"5\nAt this point we can derive in closed form the probability distribution p minimizing \"5.\nLemma 5.3. The function \"5 is minimized by pij\n\nAij A i 1. To de\ufb01ne \u21e2i\n\n\u21e2iqij where qij\n\nargminp \"4.\n\n\"3 \"4\n\n2.\n\nlet zi A i 1 and de\ufb01ne \u21e2i \u21e3\nsolution to3\n\ni \u21e2i \u21e31\n\n1. Let \u21e2i :\n\n\u21b5zi 2\u21e3 2\n\n\u21b5zi 2\u21e3\n\u21e2i \u21e31 .\n\n2\n\nzi \u21e3\n\n. Let \u21e31\n\n0 be the unique\n\nTo prove Theorem 4.2 we see that Lemmas 5.2 and 5.3 combined imply that there is an ef\ufb01cient\nalgorithm for minimizing \"4 for every matrix A satisfying Condition 1 of De\ufb01nition 4.1. If A also\nsatis\ufb01es Conditions 2 and 3 of De\ufb01nition 4.1, then it is possible to lower and upper bound the ratios\n\"1 \"2, \"2 \"3 and \"3 \"4. Combined, these bounds guarantee a lower and upper bound for \"1 \"4.\nIn general, if c\nC c min \"1 . Thus,\ncalculating the constants shows \"1 arg min \"4\n\nC we can conclude that \"1 arg min \"4\n\n3 min \"1 , yielding Theorem 4.3.\n\n\"4 \"1\n\n6 Experiments\n\nWe experimented with 4 matrices with different characteristics, summarized in the table below. See\nSection 4 for the de\ufb01nition of the different characteristics.\n\nMeasure\nSynthetic\n\nEnron\nImages\nWikipedia\n\nm\n\n1.0e+2\n1.3e+4\n5.1e+3\n4.4e+5\n\nn\n\n1.0e+4\n1.8e+5\n4.9e+5\n3.4e+6\n\nnnz A\n5.0e+5\n7.2e+5\n2.5e+8\n5.3e+8\n\nA 1\n1.8e+7\n4.0e+9\n6.5e+9\n5.3e+9\n\nA F\n3.2e+4\n5.8e+6\n2.0e+6\n7.5e+5\n\nA 2\n8.7e+3\n1.0e+6\n1.8e+6\n1.6e+5\n\nsr\n\n1.3e+1\n3.2e+1\n1.3e+0\n2.1e+1\n\nnd\n\n3.1e+5\n4.9e+5\n1.1e+7\n5.0e+7\n\nnrd\n3.2e+3\n1.5e+3\n2.3e+3\n1.9e+4\n\nEnron: Subject lines of emails in the Enron email corpus [Sty11]. Columns correspond to subject\nlines, rows to words, and entries to tf-idf values. This matrix is extremely sparse to begin with.\nWikipedia: Term-document matrix of a fragment of Wikipedia in English. Entries are tf-idf values.\nImages: A collection of images of buildings from Oxford [PCI 07]. Each column represents the\nwavelet transform of a single 128\nSynthetic: This synthetic matrix simulates a collaborative \ufb01ltering matrix. Each row corresponds to\nan item and each column to a user. Each user and each item was \ufb01rst assigned a random latent vector\n(i.i.d. Gaussian). Each value in the matrix is the dot product of the corresponding latent vectors plus\nadditional Gaussian noise. We simulated the fact that some items are more popular than others by\nretaining each entry of each item i with probability 1\n\n128 pixel grayscale image.\n\n0, . . . , m 1.\n\ni m where i\n\n6.1 Sampling techniques and quality measure\n\nThe experiments report the accuracy of sampling according to four different distributions. In Fig-\nure 6.1, Bernstein denotes the distribution of this paper, de\ufb01ned in Lemma 5.3. The Row-L1\nA i 1. L1 and\ndistribution is a simpli\ufb01ed version of the Bernstein distribution, where pij Aij\n2, respectively, as de\ufb01ned earlier in the paper. The case of L2\nL2 refer to pij Aij and pij Aij\n\n3Notice that the function\n\n\u21e2i \u21e3 is monotonically decreasing for \u21e3\n\n0 hence the solution is indeed unique.\n\n7\n\n\f2\n\n0.1 Eij Aij\n\n2 and pij\n\n2 .\n\n2 for any entry where Aij\n\nsampling was split into three sampling methods corresponding to different trimming thresholds. In\n2. In the case referred to as L2 trim\nthe method referred to as L2 no trimming is made and pij Aij\n0.1, pij Aij\n0 otherwise. The sampling\ntechnique referred to as L2 trim 0.01 is analogous with threshold 0.01 Eij Aij\nAlthough to derive our sampling probability distributions we targeted minimizing A B 2, in\nexperiments it is more informative to consider a more sensitive measure of quality of approximation.\nThe reason is that for a number of values of s, the scaling of entries required for B to be an unbiased\nestimator of A, results in A B\nA which would suggest that the all zeros matrix is a\nbetter sketch for A than the sampled matrix. We will see that this is far from being the case. As\n10A. Clearly, B is very informative of A although\na trivial example, consider the possibility B\nk A F Ak F , where P B\nk is the projection\n9 A . To avoid this pitfall, we measure P B\nA B\non the top k left singular vectors of B. Thus, Ak\nk A is the optimal rank k approximation of\nP A\nA. Intuitively, this measures how well the top k left singular vectors of B capture A, compared\nto A\u2019s own (optimal) top-k left singular vectors. We also compute AQB\nk is\nthe projection on the top k right singular vectors of A. Note that, for a given k, approximating\nthe row-space is harder than approximating the column-space since it is of dimension n which is\nsigni\ufb01cantly larger than m, a fact also borne out in the experiments. In the experiments we made\nsure to choose a suf\ufb01ciently wide range of sample sizes so that at least the best method for each\nmatrix goes from poor to near-perfect both in approximating the row and the column space. In all\ncases we report on k\n20 which is close to the upper end of what could be ef\ufb01ciently computed\non a single machine for matrices of this size. The results for all smaller values of k are qualitatively\nindistinguishable.\n\nk F Ak F where QB\n\n1\"\n0.8\"\n0.6\"\n0.4\"\n0.2\"\n0\"\n\n1\"\n0.8\"\n0.6\"\n0.4\"\n0.2\"\n0\"\n\n4\" 4.7\" 5\" 5.7\" 6\" 6.7\" 7\"\n\n4\" 4.7\" 5\" 5.7\" 6\" 6.7\" 7\"\n\n1\"\n0.8\"\n0.6\"\n0.4\"\n0.2\"\n0\"\n\n1\"\n0.8\"\n0.6\"\n0.4\"\n0.2\"\n0\"\n\n4\" 4.7\" 5\" 5.7\" 6\" 6.7\" 7\"\n\n4\" 4.7\" 5\" 5.7\" 6\" 6.7\" 7\"\n\n1%\n0.95%\n0.9%\n0.85%\n0.8%\n0.75%\n\n1\"\n0.8\"\n0.6\"\n0.4\"\n0.2\"\n0\"\n\n4%\n\n4.7%\n\n5%\n\n5.7%\n\n6%\n\n4\"\n\n4.7\"\n\n5\"\n\n5.7\"\n\n6\"\n\n1$\n0.9$\n0.8$\n0.7$\n0.6$\n0.5$\n\n1\"\n0.8\"\n0.6\"\n0.4\"\n0.2\"\n0\"\n\n4$\n\n4.7$\n\n5$\n\n5.7$\n\n6$\n\n4\"\n\n4.7\"\n\n5\"\n\n5.7\"\n\n6\"\n\nFigure 1: Each vertical pair of plots corresponds to one matrix. Left to right: Wikipedia, Im-\nages, Enron, Synthetic . Each top plot shows the quality of the column-space approximation ratio,\nB F Ak .\nBA F Ak , while the bottom plots show the row-space approximation ratio AQk\nP k\nThe number of samples s is on the x-axis in log scale x\n\nlog10 s .\n\n6.2\n\nInsights\n\nThe experiments demonstrate three main insights. First and most important, Bernstein-sampling is\nnever worse than any of the other techniques and is often strictly better. A dramatic example of\nthis is the Wikipedia matrix for which it is far superior to all other methods. The second insight\nis that L1-sampling, i.e., simply taking pij\nAij A 1, performs rather well in many cases.\nHence, if it is impossible to perform more than one pass over the matrix and one can not even obtain\nan estimate of the ratios of the L1-weights of the rows, L1-sampling seems to be a highly viable\noption. The third insight is that for L2-sampling, discarding small entries may drastically improve\nthe performance. However, it is not clear which threshold should be chosen in advance. In any case,\nin all of the example matrices, both L1-sampling and Bernstein-sampling proved to outperform or\nperform equally to L2-sampling, even with the correct trimming threshold.\n\n8\n\n\fReferences\n[AHK05] Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semide\ufb01nite pro-\ngramming using the multiplicative weights update method. In Foundations of Computer Science,\n2005. FOCS 2005. 46th Annual IEEE Symposium on, pages 339\u2013348. IEEE, 2005.\n\n[AHK06] Sanjeev Arora, Elad Hazan, and Satyen Kale. A fast random sampling algorithm for sparsifying\nmatrices. In Proceedings of the 9th international conference on Approximation Algorithms for Com-\nbinatorial Optimization Problems, and 10th international conference on Randomization and Com-\nputation, APPROX\u201906/RANDOM\u201906, pages 272\u2013279, Berlin, Heidelberg, 2006. Springer-Verlag.\n[AKV02] Noga Alon, Michael Krivelevich, and VanH. Vu. On the concentration of eigenvalues of random\n\nsymmetric matrices. Israel Journal of Mathematics, 131:259\u2013267, 2002.\n\n[AM01] Dimitris Achlioptas and Frank McSherry. Fast computation of low rank matrix approximations. In\nProceedings of the thirty-third annual ACM symposium on Theory of computing, pages 611\u2013618.\nACM, 2001.\n\n[AM07] Dimitris Achlioptas and Frank Mcsherry. Fast computation of low-rank matrix approximations. J.\n\nACM, 54(2), april 2007.\n\n[AW02] Rudolf Ahlswede and Andreas Winter. Strong converse for identi\ufb01cation via quantum channels.\n\nIEEE Transactions on Information Theory, 48(3):569\u2013579, 2002.\n\n[Ber07] Ale\u02c7s Berkopec. Hyperquick algorithm for discrete hypergeometric distribution. Journal of Discrete\n\n[CR09]\n\n[CT10]\n\nAlgorithms, 5(2):341\u2013347, 2007.\nEmmanuel J Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foun-\ndations of Computational mathematics, 9(6):717\u2013772, 2009.\nEmmanuel J Cand`es and Terence Tao. The power of convex relaxation: Near-optimal matrix com-\npletion. Information Theory, IEEE Transactions on, 56(5):2053\u20132080, 2010.\n\n[d\u2019A08] Alexandre d\u2019Aspremont. Subsampling algorithms for semide\ufb01nite programming. arXiv preprint\n\narXiv:0803.1990, 2008.\n\n[DKM06] Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast monte carlo algorithms for matrices;\n\n[DZ11]\n\n[FK81]\n\n[GT09]\n\n[Juh81]\n\napproximating matrix multiplication. SIAM J. Comput., 36(1):132\u2013157, July 2006.\nPetros Drineas and Anastasios Zouzias. A note on element-wise matrix sparsi\ufb01cation via a matrix-\nvalued bernstein inequality. Inf. Process. Lett., 111(8):385\u2013389, 2011.\nZ. F\u00a8uredi and J. Koml\u00b4os. The eigenvalues of random symmetric matrices. Combinatorica, 1(3):233\u2013\n241, 1981.\nAlex Gittens and Joel A Tropp. Error bounds for random matrix approximation schemes. arXiv\npreprint arXiv:0911.4108, 2009.\nF. Juh\u00b4asz. On the spectrum of a random graph.\nIn Algebraic methods in graph theory, Vol. I,\nII (Szeged, 1978), volume 25 of Colloq. Math. Soc. J\u00b4anos Bolyai, pages 313\u2013316. North-Holland,\nAmsterdam, 1981.\n\n[NDT09] NH Nguyen, Petros Drineas, and TD Tran. Matrix sparsi\ufb01cation via the khintchine inequality, 2009.\n[NDT10] Nam H Nguyen, Petros Drineas, and Trac D Tran. Tensor sparsi\ufb01cation via a bound on the spectral\n\nnorm of random tensors. arXiv preprint arXiv:1005.4732, 2010.\n\n[PCI 07] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies\nand fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 2007.\n\n[Rec11] Benjamin Recht. A simpler approach to matrix completion. J. Mach. Learn. Res., 12:3413\u20133430,\n\nDecember 2011.\n\n[RV07] Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geo-\n\nmetric functional analysis. J. ACM, 54(4), July 2007.\n\n[Sty11] Will Styler. The enronsent corpus. In Technical Report 01-2011, University of Colorado at Boulder\n\n[Tro12a]\n\n[Tro12b]\n\nInstitute of Cognitive Science, Boulder, CO., 2011.\nJoel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computa-\ntional Mathematics, 12(4):389\u2013434, 2012.\nJoel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational\nMathematics, 12(4):389\u2013434, 2012.\n\n[Wig58] Eugene P. Wigner. On the distribution of the roots of certain symmetric matrices. Annals of Mathe-\n\nmatics, 67(2):pp. 325\u2013327, 1958.\n\n9\n\n\f", "award": [], "sourceid": 783, "authors": [{"given_name": "Dimitris", "family_name": "Achlioptas", "institution": "UC Santa Cruz"}, {"given_name": "Zohar", "family_name": "Karnin", "institution": "Yahoo! Labs"}, {"given_name": "Edo", "family_name": "Liberty", "institution": "Yahoo! Research"}]}