{"title": "large scale canonical correlation analysis with iterative least squares", "book": "Advances in Neural Information Processing Systems", "page_first": 91, "page_last": 99, "abstract": "Canonical Correlation Analysis (CCA) is a widely used statistical tool with both well established theory and favorable performance for a wide range of machine learning problems. However, computing CCA for huge datasets can be very slow since it involves implementing QR decomposition or singular value decomposition of huge matrices. In this paper we introduce L-CCA, an iterative algorithm which can compute CCA fast on huge sparse datasets. Theory on both the asymptotic convergence and finite time accuracy of L-CCA are established. The experiments also show that L-CCA outperform other fast CCA approximation schemes on two real datasets.", "full_text": "Large Scale Canonical Correlation Analysis with\n\nIterative Least Squares\n\nYichao Lu\n\nUniversity of Pennsylvania\n\nyichaolu@wharton.upenn.edu\n\nDean P. Foster\nYahoo Labs, NYC\n\ndean@foster.net\n\nAbstract\n\nCanonical Correlation Analysis (CCA) is a widely used statistical tool with both\nwell established theory and favorable performance for a wide range of machine\nlearning problems. However, computing CCA for huge datasets can be very slow\nsince it involves implementing QR decomposition or singular value decomposi-\ntion of huge matrices. In this paper we introduce L-CCA , a iterative algorithm\nwhich can compute CCA fast on huge sparse datasets. Theory on both the asymp-\ntotic convergence and \ufb01nite time accuracy of L-CCA are established. The experi-\nments also show that L-CCA outperform other fast CCA approximation schemes\non two real datasets.\n\n1\n\nIntroduction\n\nCanonical Correlation Analysis (CCA) is a widely used spectrum method for \ufb01nding correlation\nstructures in multi-view datasets introduced by [15]. Recently, [3, 9, 17] proved that CCA is able\nto \ufb01nd the right latent structure under certain hidden state model. For modern machine learning\nproblems, CCA has already been successfully used as a dimensionality reduction technique for\nthe multi-view setting. For example, A CCA between the text description and image of the same\nobject will \ufb01nd common structures between the two different views, which generates a natural\nvector representation of the object. In [9], CCA is performed on a large unlabeled dataset in order\nto generate low dimensional features to a regression problem where the size of labeled dataset is\nsmall. In [6, 7] a CCA between words and its context is implemented on several large corpora to\ngenerate low dimensional vector representations of words which captures useful semantic features.\n\nWhen the data matrices are small, the classical algorithm for computing CCA involves \ufb01rst a\nQR decomposition of the data matrices which pre whitens the data and then a Singular Value\nDecomposition (SVD) of the whitened covariance matrix as introduced in [11]. This is exactly how\nMatlab computes CCA. But for huge datasets this procedure becomes extremely slow. For data\nmatrices with huge sample size [2] proposed a fast CCA approach based on a fast inner product\npreserving random projection called Subsampled Randomized Hadamard Transform but it\u2019s still\nslow for datasets with a huge number of features. In this paper we introduce a fast algorithm for\n\ufb01nding the top kcca canonical variables from huge sparse data matrices (a single multiplication with\nthese sparse matrices is very fast) X 2 n \u21e5 p1 and Y 2 n \u21e5 p2 the rows of which are i.i.d samples\nfrom a pair of random vectors. Here n p1, p2 1 and kcca is relatively small number like 50\nsince the primary goal of CCA is to generate low dimensional features. Under this set up, QR\ndecomposition of a n \u21e5 p matrix cost O(np2) which is extremely slow even if the matrix is sparse.\nOn the other hand since the data matrices are sparse, X>X and Y>Y can be computed very fast.\nSo another whitening strategy is to compute (X>X) 1\n2 . But when p1, p2 are large this\ntakes O(max{p3\n\n2 , (Y>Y) 1\n2}) which is both slow and numerically unstable.\n\n1, p3\n\n1\n\n\fThe main contribution of this paper is a fast iterative algorithm L-CCA consists of only QR de-\ncomposition of relatively small matrices and a couple of matrix multiplications which only involves\nhuge sparse matrices or small dense matrices. This is achieved by reducing the computation of CCA\nto a sequence of fast Least Square iterations. It is proved that L-CCA asymptotically converges to\nthe exact CCA solution and error analysis for \ufb01nite iterations is also provided. As shown by the\nexperiments, L-CCA also has favorable performance on real datasets when compared with other\nCCA approximations given a \ufb01xed CPU time.\n\nIt\u2019s worth pointing out that approximating CCA is much more challenging than SVD(or PCA).\nAs suggested by [12, 13], to approximate the top singular vectors of X, it suf\ufb01ces to randomly\nsample a small subspace in the span of X and some power iteration with this small subspace will\nautomatically converge to the directions with top singular values. On the other hand CCA has\nto search through the whole X Y span in order to capture directions with large correlation. For\nexample, when the most correlated directions happen to live in the bottom singular vectors of the\ndata matrices, the random sample scheme will miss them completely. On the other hand, what\nL-CCA algorithm doing intuitively is running an exact search of correlation structures on the top\nsingular vectors and an fast gradient based approximation on the remaining directions.\n\n2 Background: Canonical Correlation Analysis\n\n2.1 De\ufb01nition\n\nCanonical Correlation Analysis (CCA) can be de\ufb01ned in many different ways. Here we use the def-\ninition in [9, 17] since this version naturally connects CCA with the Singular Value Decomposition\n(SVD) of the whitened covariance matrix, which is the key to understanding our algorithm.\nDe\ufb01nition 1. Let X 2 n \u21e5 p1 and Y 2 n \u21e5 p2 where the rows are i.i.d samples from a pair of\nrandom vectors. Let x 2 p1 \u21e5 p1, y 2 p2 \u21e5 p2 and use x,i, y,j to denote the columns of\nx, y respectively. Xx,i, Yy,j are called canonical variables if\ni = j\ni 6= j\n\n>x,iX>Yy,j = \u21e2di\n\nif\nif\n\n0\n\n>x,iX>Xx,j =\u21e21\n\n0\n\nif\nif\n\ni = j\ni 6= j\n\n>y,iY>Yy,j =\u21e21\n\n0\n\nif\nif\n\ni = j\ni 6= j\n\nXx,i, Yy,i is the ith pair of canonical variables and di is the ith canonical correlation.\n\n2.2 CCA and SVD\n\nFirst introduce some notation. Let\n\nFor simplicity assume Cxx and Cyy are full rank and Let\n\nCxx = X>X Cyy = Y>Y Cxy = X>Y\n\n\u02dcCxy = C 1\n\nxx CxyC 1\n\n2\nyy\n\n2\n\nThe following lemma provides a way to compute the canonical variables by SVD.\nLemma 1. Let \u02dcCxy = UDV> be the SVD of \u02dcCxy where ui, vj denote the left, right singular\nvectors and di denotes the singular values. Then XC 1\nyy vj are the canonical variables\nof the X, Y space respectively.\n\nxx ui, YC 1\n\n2\n\n2\n\nProof. Plug XC 1\n\nxx ui, YC 1\n\n2\n\n2\n\nyy vj into the equations in De\ufb01nition 1 directly proves lemma 1\n\nAs mentioned before, we are interested in computing the top kcca canonical variables where kcca \u2327\np1, p2. Use U1, V1 to denote the \ufb01rst kcca columns of U, V respectively and use U2, V2 for the\nremaining columns. By lemma 1, the top kcca canonical variables can be represented by XC 1\nxx U1\nand YC 1\n\nyy V1.\n\n2\n\n2\n\n2\n\n\fAlgorithm 1 CCA via Iterative LS\n\nInput : Data matrix X 2 n \u21e5 p1 ,Y 2 n \u21e5 p2. A target dimension kcca. Number of orthogonal\niterations t1\nOutput : Xkcca 2 n \u21e5 kcca, Ykcca 2 n \u21e5 kcca consist of top kcca canonical variables of X and Y.\n1.Generate a p1 \u21e5 kcca dimensional random matrix G with i.i.d standard normal entries.\n2.Let X0 = XG\n3.\nfor t = 1 to t1 do\n\nYt = HYXt1 where HY = Y(Y>Y)1Y>\nXt = HXYt where HX = X(X>X)1X>\n\nend for\n4.Xkcca = QR(Xt1), Ykcca = QR(Yt1)\nFunction QR(Xt) extract an orthonormal basis of the column space of Xt with QR decomposition\n\n3 Compute CCA by Iterative Least Squares\n\nSince the top canonical variables are connected with the top singular vectors of \u02dcCxy which can\nbe compute with orthogonal iteration [10] (it\u2019s called simultaneous iteration in [21]), we can also\ncompute CCA iteratively. A detailed algorithm is presented in Algorithm1:\nThe convergence result of Algorithm 1 is stated in the following theorem:\nTheorem 1. Assume |d1| > |d2| > |d3|... > |dkcca+1| and U>1 C\nxxG is non singular (this will hold\nwith probability 1 if the elements of G are i.i.d Gaussian). The columns of Xkcca and Ykcca will\nconverge to the top kcca canonical variables of X and Y respectively if t1 ! 1.\nTheorem 1 is proved by showing it\u2019s essentially an orthogonal iteration [10, 21] for computing the\ntop kcca eigenvectors of A = \u02dcCxy \u02dcC>xy. A detailed proof is provided in the supplementary materials.\n\n1\n2\n\n3.1 A Special Case\n\nWhen X Y are sparse and Cxx, Cyy are diagonal (like the Penn Tree Bank dataset in the experi-\nments), Algorithm 1 can be implemented extremely fast since we only need to multiply with sparse\nmatrices or inverting huge but diagonal matrices in every iteration. QR decomposition is performed\nnot only in the end but after every iteration for numerical stability issues (here we only need to QR\nwith matrices much smaller than X, Y). We call this fast version D-CCA in the following discus-\nsions.\nWhen Cxx, Cyy aren\u2019t diagonal, computing matrix inverse becomes very slow. But we can still run\nD-CCA by approximating (X>X)1, (Y>Y)1 with (diag(X>X))1, (diag(Y>Y))1 in algo-\nrithm 1 when speed is a concern. But this leads to poor performance when Cxx, Cyy are far from\ndiagonal as shown by the URL dataset in the experiments.\n\n3.2 General Case\n\nAlgorithm 1 reduces the problem of CCA to a sequence of iterative least square problems. When\nX, Y are huge, solving LS exactly is still slow since it consists inverting a huge matrix but fast\nLS methods are relatively well studied. There are many ways to approximate the LS solution by\noptimization based methods like Gradient Descent [1, 23], Stochastic Gradient Descent [16, 4] or\nby random projection and subsampling based methods like [8, 5]. A fast approximation to the top\nkcca canonical variables can be obtained by replacing the exact LS solution in every iteration of Al-\ngorithm 1 with a fast approximation. Here we choose LING [23] which works well for large sparse\ndesign matrices for solving the LS problem in every CCA iteration.\nThe connection between CCA and LS has been developed under different setups for different pur-\nposes. [20] shows that CCA in multi label classi\ufb01cation setting can be formulated as an LS problem.\n[22] also formulates CCA as a recursive LS problem and builds an online version based on this\nobservation. The bene\ufb01t we take from this iterative LS formulation is that running a fast LS ap-\n\n3\n\n\fAlgorithm 2 LING\n\nInput : X 2 n \u21e5 p ,Y 2 n \u21e5 1. kpc, number of top left singular vectors selected. t2, number of\niterations in Gradient Descent.\nOutput : \u02c6Y 2 n \u21e5 1, which is an approximation to X(X>X)1X>Y\n1. Compute U1 2 n\u21e5 kpc, top kpc left singular vector of X by randomized SVD (See supplemen-\ntary materials for detailed description).\n2. Y1 = U1U>1 X.\n3.Compute the residual. Yr = Y Y1\n4.Use gradient descent initial at the 0 vector (see supplementary materials for detailed description)\nto approximately solve the LS problem minr2Rp kX r Yrk2. Use r,t2 to denote the solution\nafter t2 gradient iterations.\n5. \u02c6Y = Y1 + X r,t2.\n\nproximation in every iteration will give us a fast CCA approximation with both provable theoretical\nguarantees and favorable experimental performance.\n\n4 Algorithm\n\nIn this section we introduce L-CCA which is a fast CCA algorithm based on Algorithm 1.\n\n4.1 LING: a Gradient Based Least Square Algorithm\n\nFirst we need to introduce the fast LS algorithm LING as mentioned in section 3.2 which is used in\nevery orthogonal iteration of L-CCA .\nConsider the LS problem:\n\n\u21e4 = arg min\n\n2Rp{kX Y k2}\n\nfor X 2 n \u21e5 p and Y 2 n \u21e5 1. For simplicity assume X is full rank. X \u21e4 = X(X>X)1X>Y is\nthe projection of Y onto the column space of X. In this section we introduce a fast algorithm LING\nto approximately compute X \u21e4 without formulating (X>X)1 explicitly which is slow for large p.\nThe intuition of LING is as follows. Let U1 2 n \u21e5 kpc (kpc \u2327 p) be the top kpc left singular vectors\nof X and U2 2 n \u21e5 (p kpc) be the remaining singular vectors. In LING we decompose X \u21e4 into\ntwo orthogonal components,\n\nX \u21e4 = U1U>1 Y + U2U>2 Y\n\nthe projection of Y onto the span of U1 and the projection onto the span of U2. The \ufb01rst term can\nbe computed fast given U1 since kpc is small. U1 can also be computed fast approximately with the\nrandomized SVD algorithm introduced in [12] which only requires a few fast matrix multiplication\nand a QR decomposition of n \u21e5 kpc matrix. The details for \ufb01nding U1 are illustrated in the supple-\nmentary materials. Let Yr = Y U1U>1 Y be the residual of Y after projecting onto U1. For the\nsecond term, we compute it by solving the optimization problem\n\nmin\n\nr2Rp{kX r Yrk2}\n\nwith Gradient Descent (GD) which is also described in detail in the supplementary materials. A\ndetailed description of LING are presented in Algorithm 2.\nIn the above discussion Y is a column vector. It is straightforward to generalize LING to \ufb01t into\nAlgorithm 1 where Y have multiple columns by applying Algorithm 2 to every column of Y .\nIn the following discussions, we use LING (Y, X, kpc, t2) to denote the LING output with corre-\nsponding inputs which is an approximation to X(X>X)1X>Y .\nThe following theorem gives error bound of LING .\nTheorem 2. Use i to denote the ith singular value of X. Consider the LS problem\n\nmin\n\n2Rp{kX Y k2}\n\n4\n\n\fAlgorithm 3 L-CCA\n\nInput : X 2 n \u21e5 p1 ,Y 2 n \u21e5 p2: Data matrices.\nkcca: Number of top canonical variables we want to extract.\nt1: Number of orthogonal iterations.\nkpc: Number of top singular vectors for LING\nt2: Number of GD iterations for LING\nOutput : Xkcca 2 n \u21e5 kcca, Ykcca 2 n \u21e5 kcca: Top kcca canonical variables of X and Y.\n1.Generate a p1 \u21e5 kcca dimensional random matrix G with i.i.d standard normal entries.\n2.Let X0 = XG, \u02c6X0 = QR(X0)\n3.\nfor t = 1 to t1 do\n\nYt = LING( \u02c6Xt1, Y, kpc, t2), \u02c6Yt = QR(Yt)\nXt = LING( \u02c6Yt, X, kpc, t2), \u02c6Xt = QR(Xt)\n\nend for\n4.Xkcca = \u02c6Xt1, Ykcca = \u02c6Yt1\n\nfor X 2 n\u21e5 p and Y 2 n\u21e5 1. Let Y \u21e4 = X(X>X)1X>Y be the projection of Y onto the column\nspace of X and \u02c6Yt2 = LING (Y, X, kpc, t2). Then\n\nfor some constant C > 0 and r =\n\nkY \u21e4 \u02c6Yt2k2 \uf8ff Cr2t2\n2\nkpc+12\n2\nkpc+1+2\n\n< 1\n\np\n\np\n\n(1)\n\nThe proof is in the supplementary materials due to space limitation.\nRemark 1. Theorem 2 gives some intuition of why LING decompose the projection into two com-\nponents. In an extreme case if we set kpc = 0 (i.e. don\u2019t remove projection on the top principle\ncomponents and directly apply GD to the LS problem), r in equation 1 becomes 2\n12\n. Usually\np\n2\n1+2\np\n1 is much larger than p, so r is very close to 1 which makes the error decays slowly. Removing\nprojections on kpc top singular vector will accelerate error decay by making r smaller. The bene\ufb01t\nof this trick is easily seen in the experiment section.\n\n4.2 Fast Algorithm for CCA\n\nOur fast CCA algorithm L-CCA are summarized in Algorithm 3:\nThere are two main differences between Algorithm 1 and 3. We use LING to solve Least squares\napproximately for the sake of speed. We also apply QR decomposition on every LING output for\nnumerical stability issues mentioned in [21].\n\n4.3 Error Analysis of L-CCA\n\nThis section provides mathematical results on how well the output of L-CCA algorithm approxi-\nmates the subspace spanned by the top kcca true canonical variables for \ufb01nite t1 and t2. Note that the\nasymptotic convergence property of L-CCA when t1, t2 ! 1 has already been stated by theorem\n1. First we need to de\ufb01ne the distances between subspaces as introduced in section 2.6.3 of [10]:\nDe\ufb01nition 2. Assume the matrices are full rank. The distance between the column space of matrix\nW1 2 n \u21e5 k and Z1 2 n \u21e5 k is de\ufb01ned by\n\ndist(W1, Z1) = kHW1 HZ1k2\n\nwhere HW1 = W1(W>1 W1)1W>1 , HZ1 = Z1(Z>1 Z1)1Z>1 are projection matrices. Here\nthe matrix norm is the spectrum norm. Easy to see dist(W1, Z1) = dist(W1R1, Z1R2) for any\ninvertible k \u21e5 k matrix R1, R2.\nWe continue to use the notation de\ufb01ned in section 2. Recall that XC 1\nical variables from X. The following theorem bounds the distance between the truth XC 1\n\u02c6Xt1, the L-CCA output after \ufb01nite iterations.\n\nxx U1 gives the top kcca canon-\nxx U1 and\n\n2\n\n2\n\n5\n\n\fTheorem 3. The distance between subspaces spanned top kcca canonical variables of X and the\nsubspace returned by L-CCA is bounded by\n\ndist( \u02c6Xt1, XC 1\n\n2\n\nxx U1) \uf8ff C1\u2713 dkcca+1\n\ndkcca \u25c62t1\n\nd2\nkcca\n\nr2t2\n\n+ C2\n\nkcca d2\nd2\n\nkcca+1\n\nwhere C1, C2 are constants. 0 < r < 1 is introduced in theorem 2. t1 is the number of power\niterations in L-CCA and t2 is the number of gradient iterations for solving every LS problem.\n\nThe proof of theorem 3 is in the supplementary materials.\n\n5 Experiments\n\nIn this section we compare several fast algorithms for computing CCA on large datasets. First let\u2019s\nintroduce the algorithms we compared in the experiments.\n\n\u2022 RPCCA : Instead of running CCA directly on the high dimensional X Y, RPCCA com-\nputes CCA only between the top krpcca principle components (left singular vector) of X\nand Y where krpcca \u2327 p1, p2. For large n, p1, p2, we use randomized algorithm introduced\nin [12] for computing the top principle components of X and Y (see supplementary ma-\nterial for details). The tuning parameter that controls the tradeoff between computational\ncost and accuracy is krpcca. When krpcca is small RPCCA is fast but fails to capture the\ncorrelation structure on the bottom principle components of X and Y. When krpcca grows\nlarger the principle components captures more structure in X Y space but it takes longer\nto compute the top principle components. In the experiments we vary krpcca.\n\n\u2022 D-CCA : See section 3.1 for detailed descriptions. The advantage of D-CCA is it\u2019s ex-\ntremely fast.\nIn the experiments we iterate 30 times (t1 = 30) to make sure D-CCA\nachieves convergence. As mentioned earlier, when Cxx and Cyy are far from diagonal\nD-CCA becomes inaccurate.\n\n\u2022 L-CCA : See Algorithm 3 for detailed description. We \ufb01nd that the accuracy of LING\nin every orthogonal iteration is crucial to \ufb01nding directions with large correlation while a\nsmall t1 suf\ufb01ces. So in the experiments we \ufb01x t1 = 5 and vary t2. In both experiments\nwe \ufb01x kpc = 100 so the top kpc singular vectors of X, Y and every LING iteration can be\ncomputed relatively fast.\n\n\u2022 G-CCA : A special case of Algorithm 3 where kpc is set to 0. I.e.\n\nthe LS projection in\nevery iteration is computed directly by GD. G-CCA does not need to compute top singular\nvectors of X and Y as L-CCA . But by equation 1 and remark 1 GD takes more iterations\nto converge compared with LING . Comparing G-CCA and L-CCA in the experiments\nillustrates the bene\ufb01t of removing the top singular vectors in LING and how this can affect\nthe performance of the CCA algorithm. Same as L-CCA we \ufb01x the number of orthogonal\niterations t1 to be 5 and vary t2, the number of gradient iterations for solving LS.\n\nRPCCA , L-CCA , G-CCA are all \"asymptotically correct\" algorithms in the sense that if we\nspend in\ufb01nite CPU time all three algorithms will provide the exact CCA solution while D-CCA is\nextremely fast but relies on the assumption that X Y both have orthogonal columns. Intuitively,\ngiven a \ufb01xed CPU time, RPCCA dose an exact search on krpcca top principle components of X\nand Y. L-CCA does an exact search on the top kpc principle components (kpc < krpcca) and an\ncrude search over the other directions. G-CCA dose a crude search over all the directions. The\ncomparison is in fact testing which strategy is the most effective in \ufb01nding large correlations over\nhuge datasets.\nRemark 2. Both RPCCA and G-CCA can be regarded as special cases of L-CCA . When t1 is\nlarge and t2 is 0, L-CCA becomes RPCCA and when kpc is 0 L-CCA becomes G-CCA .\n\nIn the following experiments we aims at extracting 20 most correlated directions from huge data\nmatrices X and Y. The output of the above four algorithms are two n\u21e5 20 matrices Xkcca and Ykcca\nthe columns of which contains the most correlated directions. Then a CCA is performed between\nXkcca and Ykcca with matlab built-in CCA function. The canonical correlations between Xkcca and\nYkcca indicates the amount of correlations captured from the the huge X Y spaces by above four\n\n6\n\n\falgorithms. In all the experiments, we vary krpcca for RPCCA and t2 for L-CCA and G-CCA to\nmake sure these three algorithms spends almost the same CPU time ( D-CCA is alway fastest). The\n20 canonical correlations between the subspaces returned by the four algorithms are plotted (larger\nmeans better).\n\nWe want to make to additional comments here based on the reviewer\u2019s feedback. First, for the two\ndatasets considered in the experiments, classical CCA algorithms like the matlab built in function\ntakes more than an hour while our algorithm is able to get an approximate answer in less than 10\nminutes. Second, in the experiments we\u2019ve been focusing on getting a good \ufb01t on the training\ndatasets and the performance is evaluated by the magnitude of correlation captured in sample. To\nachieve better generalization performance a common trick is to perform regularized CCA [14] which\neasily \ufb01ts into our frame work since it\u2019s equivalent to running iterative ridge regression instead of\nOLS in Algorithm 1. Since our goal is to compute a fast and accurate \ufb01t, we don\u2019t pursue the\ngeneralization performance here which is another statistical issue.\n\n5.1 Penn Tree Bank Word Co-ocurrence\n\nCCA has already been successfully applied to building a low dimensional word embedding in [6, 7].\nSo the \ufb01rst task is a CCA between words and their context. The dataset used is the full Wall Street\nJournal Part of Penn Tree Bank which consists of 1.17 million tokens and a vocabulary size of 43k\n[18]. The rows of X matrix consists the indicator vectors of the current word and the rows of Y\nconsists of indicators of the word after. To avoid sample sparsity for Y we only consider 3000 most\nfrequent words, i.e. we only consider the tokens followed by 3000 most frequent words which is\nabout 1 million. So X is of size 1000k \u21e5 43k and Y is of size 1000k \u21e5 3k where both X and\nY are very sparse. Note that every row of X and Y only has a single 1 since they are indicators\nof words. So in this case Cxx, Cyy are diagonal and D-CCA can compute a very accurate CCA\nin less than a minute as mentioned in section 3.1. On the other hand, even though this dataset can\nbe solved ef\ufb01ciently by D-CCA , it is interesting to look at the behavior of other three algorithms\nwhich do not make use of the special structure of this problem and compare them with D-CCA\nwhich can be regarded as the truth in this particular case. For RPCCA L-CCA G-CCA we try\nthree different parameter set ups shown in table 1 and the 20 correlations are shown in \ufb01gure 1.\nAmong the three algorithms L-CCA performs best and gets pretty close to D-CCA as CPU time\nincreases. RPCCA doesn\u2019t perform well since a lot correlation structure of word concurrence exist\nin low frequency words which can\u2019t be captured in the top principle components of X Y. Since the\nmost frequent word occurs 60k times and the least frequent words occurs only once, the spectral of\nX drops quickly which makes GD converges very slowly. So G-CCA doesn\u2019t perform well either.\n\nTable 1: Parameter Setup for Two Real Datasets\n\nPTB word co-occurrence\n\nt2\n\nt2\n\nkrpcca\nRPCCA L-CCA G-CCA time\n170\n460\n1180\n\n300\n500\n800\n\n17\n51\n127\n\n7\n38\n115\n\nCPU id\n\nURL features\n\nt2\n\nt2\n\nCPU\nkrpcca\nRPCCA L-CCA G-CCA time\n220\n175\n130\n\n600\n600\n600\n\n7\n16\n17\n\n4\n11\n13\n\n1\n2\n3\n\nid\n\n1\n2\n3\n\n5.2 URL Features\n\nThe second dataset is the URL Reputation dataset from UCI machine learning repository. The\ndataset contains 2.4 million URLs each represented by 3.2 million features. For simplicity we only\nuse \ufb01rst 400k URLs. 38% of the features are host based features like WHOIS info, IP pre\ufb01x and\n62% are lexical based features like Hostname and Primary domain. See [19] for detailed information\nabout this dataset. Unfortunately the features are anonymous so we pick the \ufb01rst 35% features as our\nX and last 35% features as our Y. We remove the 64 continuous features and only use the Boolean\nfeatures. We sort the features according to their frequency (each feature is a column of 0s and 1s,\nthe column with most 1s are the most frequent feature). We run CCA on three different subsets of\nX and Y. In the \ufb01rst experiment we select the 20k most frequent features of X and Y respectively.\n\n7\n\n\fPTB Word Occurrence CPU time: 170 secs\n\nPTB Word Occurrence CPU time: 460 secs\n\nPTB Word Occurrence CPU time: 1180 secs\n\nl\n\nn\no\ni\nt\na\ne\nr\nr\no\nC\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n \n\n \n\nL\u2212CCA\n\nD\u2212CCA\n\nRPCCA\n\nG\u2212CCA\n\n5\n\n10\nIndex\n\n15\n\n20\n\nl\n\nn\no\ni\nt\na\ne\nr\nr\no\nC\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n \n\n \n\nL\u2212CCA\n\nD\u2212CCA\n\nRPCCA\n\nG\u2212CCA\n\n5\n\n10\nIndex\n\n15\n\n20\n\nl\n\nn\no\ni\nt\na\ne\nr\nr\no\nC\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n \n\n \n\nL\u2212CCA\n\nD\u2212CCA\n\nRPCCA\n\nG\u2212CCA\n\n5\n\n10\nIndex\n\n15\n\n20\n\nFigure 1: PTB word co-ocurrence: Canonical correlations of the 20 directions returned by four\nalgorithms. x axis are the indices and y axis are the correlations.\n\nIn the second experiment we select 20k most frequent features from X Y after removing the top\n100 most frequent features of X and 200 most frequent features of Y. In the third experiment we\nremove top 200 most frequent features from X and top 400 most frequent features of Y. So we are\ndoing CCA between two 400k \u21e4 20k data matrices in these experiments. In this dataset the features\nwithin X and Y has huge correlations, so Cxx and Cyy aren\u2019t diagonal anymore. But we still run\nD-CCA since it\u2019s extremely fast. The parameter set ups for the three subsets are shown in table 1\nand the 20 correlations are shown in \ufb01gure 2.\nFor this dataset the fast D-CCA doesn\u2019t capture largest correlation since the correlation within X\nand Y make Cxx, Cyy not diagonal. RPCCA has best performance in experiment 1 but not as good\nin 2, 3. On the other hand G-CCA has good performance in experiment 3 but performs poorly in 1,\n2. The reason is as follows: In experiment 1 the data matrices are relatively dense since they includes\nsome frequent features. So every gradient iteration in L-CCA and G-CCA is slow. Moreover, since\nthere are some high frequency features and most features has very low frequency, the spectrum of\nthe data matrices in experiment 1 are very steep which makes GD in every iteration of G-CCA\nconverges very slowly. These lead to poor performance of G-CCA . In experiment 3 since the\nfrequent features are removed data matrices becomes more sparse and has a \ufb02at spectrum which is\nin favor of G-CCA . L-CCA has stable and close to best performance despite those variations in\nthe datasets.\n\nn\no\n\ni\nt\n\nl\n\na\ne\nr\nr\no\nC\n\n1\n\n0.98\n\n0.96\n\n0.94\n\n0.92\n\n0.9\n\n0.88\n\n0.86\n\n0.84\n\n0.82\n\n0.8\n\n \n\nURL1 CPU time: 220secs\n\nURL2 CPU time: 175secs\n\nURL3 CPU time: 130secs\n\n \n\nL\u2212CCA\n\nD\u2212CCA\n\nRPCCA\n\nG\u2212CCA\n\n5\n\n10\nIndex\n\n15\n\n20\n\nn\no\n\ni\nt\n\nl\n\na\ne\nr\nr\no\nC\n\n1\n\n0.98\n\n0.96\n\n0.94\n\n0.92\n\n0.9\n\n0.88\n\n0.86\n\n0.84\n\n0.82\n\n0.8\n\n \n\n \n\nL\u2212CCA\n\nD\u2212CCA\n\nRPCCA\n\nG\u2212CCA\n\n5\n\n10\nIndex\n\n15\n\n20\n\nn\no\n\ni\nt\n\nl\n\na\ne\nr\nr\no\nC\n\n1\n\n0.98\n\n0.96\n\n0.94\n\n0.92\n\n0.9\n\n0.88\n\n0.86\n\n0.84\n\n0.82\n\n0.8\n\n \n\n \n\nL\u2212CCA\n\nD\u2212CCA\n\nRPCCA\n\nG\u2212CCA\n\n5\n\n10\nIndex\n\n15\n\n20\n\nFigure 2: URL: Canonical correlations of the 20 directions returned by four algorithms. x axis are\nthe indices and y axis are the correlations.\n\n6 Conclusion and Future Work\n\nIn this paper we introduce L-CCA , a fast CCA algorithm for huge sparse data matrices. We\nconstruct theoretical bound for the approximation error of L-CCA comparing with the true CCA\nsolution and implement experiments on two real datasets in which L-CCA has favorable perfor-\nmance. On the other hand, there are many interesting fast LS algorithms with provable guarantees\nwhich can be plugged into the iterative LS formulation of CCA. Moreover, in the experiments we\nfocus on how much correlation is captured by L-CCA for simplicity. It\u2019s also interesting to use\nL-CCA for feature generation and evaluate it\u2019s performance on speci\ufb01c learning tasks.\n\n8\n\n\fReferences\n[1] Marina A.Epelman. Rate of convergence of steepest descent algorithm. 2007.\n[2] Haim Avron, Christos Boutsidis, Sivan Toledo, and Anastasios Zouzias. Ef\ufb01cient dimension-\n\nality reduction for canonical correlation analysis. In ICML (1), pages 347\u2013355, 2013.\n\n[3] Francis R. Bach and Michael I. Jordan. A probabilistic interpretation of canonical correlation\n\nanalysis. Technical report, University of California, Berkeley, 2005.\n\n[4] L\u00e9on Bottou. Large-Scale Machine Learning with Stochastic Gradient Descent.\n\nIn Yves\nLechevallier and Gilbert Saporta, editors, Proceedings of the 19th International Conference\non Computational Statistics (COMPSTAT\u20192010), pages 177\u2013187, Paris, France, August 2010.\nSpringer.\n\n[5] Paramveer Dhillon, Yichao Lu, Dean P. Foster, and Lyle Ungar. New subsampling algorithms\nfor fast least squares regression. In Advances in Neural Information Processing Systems 26,\npages 360\u2013368. 2013.\n\n[6] Paramveer S. Dhillon, Dean Foster, and Lyle Ungar. Multi-view learning of word embeddings\n\nvia cca. In Advances in Neural Information Processing Systems (NIPS), volume 24, 2011.\n\n[7] Paramveer S. Dhillon, Jordan Rodu, Dean P. Foster, and Lyle H. Ungar. Two step cca: A new\nspectral method for estimating vector models of words. In Proceedings of the 29th Interna-\ntional Conference on Machine learning, ICML\u201912, 2012.\n\n[8] Petros Drineas, Michael W. Mahoney, S. Muthukrishnan, and Tam\u00e1s Sarl\u00f3s. Faster least\n\nsquares approximation. CoRR, abs/0710.1435, 2007.\n\n[9] Dean P. Foster, Sham M. Kakade, and Tong Zhang. Multi-view dimensionality reduction via\n\ncanonical correlation analysis. Technical report, 2008.\n\n[10] Gene H. Golub and Charles F. Van Loan. Matrix Computations (3rd Ed.). Johns Hopkins\n\nUniversity Press, Baltimore, MD, USA, 1996.\n\n[11] Gene. H Golub and Hongyuan Zha. The canonical correlations of matrix pairs and their nu-\nmerical computation. Technical report, Computer Science Department, Stanford University,\n1992.\n\n[12] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic\nalgorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2):217\u2013288,\nMay 2011.\n\n[13] Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert. An algorithm for\nthe principal component analysis of large data sets. SIAM J. Scienti\ufb01c Computing, 33(5):2580\u2013\n2594, 2011.\n\n[14] David R. Hardoon, Sandor Szedmak, Or Szedmak, and John Shawe-taylor. Canonical correla-\n\ntion analysis; an overview with application to learning methods. Technical report, 2007.\n\n[15] H Hotelling. Relations between two sets of variables. Biometrika, 28:312\u2013377, 1936.\n[16] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari-\n\nance reduction. Advances in Neural Information Processing Systems (NIPS), 2013.\n\n[17] Sham M. Kakade and Dean P. Foster. Multi-view regression via canonical correlation analysis.\n\nIn In Proc. of Conference on Learning Theory, 2007.\n\n[18] Michael Lamar, Yariv Maron, Mark Johnson, and Elie Bienenstock. SVD and Clustering for\nUnsupervised POS Tagging. In Proceedings of the ACL 2010 Conference Short Papers, pages\n215\u2013219, Uppsala, Sweden, 2010. Association for Computational Linguistics.\n\n[19] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Identifying suspicious\nurls: An application of large-scale online learning. In In Proc. of the International Conference\non Machine Learning (ICML), 2009.\n\n[20] Liang Sun, Shuiwang Ji, and Jieping Ye. A least squares formulation for canonical correlation\nanalysis. In Proceedings of the 25th International Conference on Machine Learning, ICML\n\u201908, pages 1024\u20131031, New York, NY, USA, 2008. ACM.\n\n[21] Lloyd N. Trefethen and David Bau. Numerical Linear Algebra. SIAM, 1997.\n\n9\n\n\f", "award": [], "sourceid": 91, "authors": [{"given_name": "Yichao", "family_name": "Lu", "institution": "University of Pennsylvania"}, {"given_name": "Dean", "family_name": "Foster", "institution": "University of Pennsylvania"}]}