{"title": "Hierarchical Eigensolver for Transition Matrices in Spectral Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 273, "page_last": 280, "abstract": null, "full_text": "Hierarchical Eigensolver for Transition Matrices\n\nin Spectral Methods\n\nChakra Chennubhotla(cid:3) and Allan D. Jepsony\n\n(cid:3)Department of Computational Biology, University of Pittsburgh\n\nyDepartment of Computer Science, University of Toronto\n\nAbstract\n\nWe show how to build hierarchical, reduced-rank representation for large\nstochastic matrices and use this representation to design an ef\ufb01cient al-\ngorithm for computing the largest eigenvalues, and the corresponding\neigenvectors. In particular, the eigen problem is \ufb01rst solved at the coars-\nest level of the representation. The approximate eigen solution is then\ninterpolated over successive levels of the hierarchy. A small number of\npower iterations are employed at each stage to correct the eigen solution.\nThe typical speedups obtained by a Matlab implementation of our fast\neigensolver over a standard sparse matrix eigensolver [13] are at least a\nfactor of ten for large image sizes. The hierarchical representation has\nproven to be effective in a min-cut based segmentation algorithm that we\nproposed recently [8].\n\n1 Spectral Methods\nGraph-theoretic spectral methods have gained popularity in a variety of application do-\nmains: segmenting images [22]; embedding in low-dimensional spaces [4, 5, 8]; and clus-\ntering parallel scienti\ufb01c computation tasks [19]. Spectral methods enable the study of prop-\nerties global to a dataset, using only local (pairwise) similarity or af\ufb01nity measurements be-\ntween the data points. The global properties that emerge are best understood in terms of a\nrandom walk formulation on the graph. For example, the graph can be partitioned into clus-\nters by analyzing the perturbations to the stationary distribution of a Markovian relaxation\nprocess de\ufb01ned in terms of the af\ufb01nity weights [17, 18, 24, 7]. The Markovian relaxation\nprocess need never be explicitly carried out; instead, it can be analytically expressed using\nthe leading order eigenvectors, and eigenvalues, of the Markov transition matrix.\nIn this paper we consider the practical application of spectral methods to large datasets. In\nparticular, the eigen decomposition can be very expensive, on the order of O(n3), where n\nis the number of nodes in the graph. While it is possible to compute analytically the \ufb01rst\neigenvector (see x3 below), the remaining subspace of vectors (necessary for say clustering)\nhas to be explicitly computed. A typical approach to dealing with this dif\ufb01culty is to \ufb01rst\nsparsify the links in the graph [22] and then apply an ef\ufb01cient eigensolver [13, 23, 3].\nIn comparison, we propose in this paper a specialized eigensolver suitable for large stochas-\ntic matrices with known stationary distributions. In particular, we exploit the spectral prop-\nerties of the Markov transition matrix to generate hierarchical, successively lower-ranked\napproximations to the full transition matrix. The eigen problem is solved directly at the\ncoarsest level of representation. The approximate eigen solution is then interpolated over\nsuccessive levels of the hierarchy, using a small number of power iterations to correct the\nsolution at each stage.\n\n\f2 Previous Work\nOne approach to speeding up the eigen decomposition is to use the fact that the columns\nof the af\ufb01nity matrix are typically correlated. The idea then is to pick a small number\nof representative columns to perform eigen decomposition via SVD. For example, in the\nNystrom approximation procedure, originally proposed for integral eigenvalue problems,\nthe idea is to randomly pick a small set of m columns; generate the corresponding af\ufb01nity\nmatrix; solve the eigenproblem and \ufb01nally extend the solution to the complete graph [9, 10].\nThe Nystrom method has also been recently applied in the kernel learning methods for\nfast Gaussian process classi\ufb01cation and regression [25]. Other sampling-based approaches\ninclude the work reported in [1, 2, 11].\nOur starting point is the transition matrix generated from af\ufb01nity weights and we show\nhow building a representational hierarchy follows naturally from considering the stochas-\ntic matrix. A closely related work is the paper by Lin on reduced rank approximations of\ntransition matrices [14]. We differ in how we approximate the transition matrices, in par-\nticular our objective function is computationally less expensive to solve. In particular, one\nof our goals in reducing transition matrices is to develop a fast, specialized eigen solver for\nspectral clustering. Fast eigensolving is also the goal in ACE [12], where successive levels\nin the hierarchy can potentially have negative af\ufb01nities. A graph coarsening process for\nclustering was also pursued in [21, 3].\n3 Markov Chain Terminology\nWe \ufb01rst provide a brief overview of the Markov chain terminology here (for more details\nsee [17, 15, 6]). We consider an undirected graph G = (V; E) with vertices vi, for i =\nf1; : : : ; ng, and edges ei;j with non-negative weights ai;j. Here the weight ai;j represents\nthe af\ufb01nity between vertices vi and vj. The af\ufb01nities are represented by a non-negative,\nsymmetric n (cid:2) n matrix A having weights ai;j as elements. The degree of a node j is\nj=1 aj;i, where we de\ufb01ne D = diag(d1; : : : ; dn).\nA Markov chain is de\ufb01ned using these af\ufb01nities by setting a transition probability matrix\nM = AD(cid:0)1, where the columns of M each sum to 1. The transition probability matrix\nde\ufb01nes the random walk of a particle on the graph G.\nThe random walk need never be explicitly carried out; instead, it can be analytically ex-\npressed using the leading order eigenvectors, and eigenvalues, of the Markov transition\nmatrix. Because the stochastic matrices need not be symmetric in general, a direct eigen\ndecomposition step is not preferred for reasons of instability. This problem is easily circum-\nvented by considering a normalized af\ufb01nity matrix: L = D(cid:0)1=2AD(cid:0)1=2, which is related\nto the stochastic matrix by a similarity transformation: L = D(cid:0)1=2M D1=2. Because L\nis symmetric, it can be diagonalized: L = U (cid:3)U T , where U = [~u1; ~u2; (cid:1) (cid:1) (cid:1) ; ~un] is an\northogonal set of eigenvectors and (cid:3) is a diagonal matrix of eigenvalues [(cid:21)1; (cid:21)2; (cid:1) (cid:1) (cid:1) ; (cid:21)n]\nsorted in decreasing order. The eigenvectors have unit length k~ukk = 1 and from the\nform of A and D it can be shown that the eigenvalues (cid:21)i 2 ((cid:0)1; 1], with at least one\neigenvalue equal to one. Without loss of generality, we take (cid:21)1 = 1. Because L and M\nare similar we can perform an eigen decomposition of the Markov transition matrix as:\nM = D1=2LD(cid:0)1=2 = D1=2U (cid:3) U T D(cid:0)1=2. Thus an eigenvector ~u of L corresponds to\nan eigenvector D1=2~u of M with the same eigenvalue (cid:21).\nThe Markovian relaxation process after (cid:12) iterations, namely M (cid:12), can be represented as:\nM (cid:12) = D1=2U (cid:3)(cid:12)U T D(cid:0)1=2. Therefore, a particle undertaking a random walk with an\ninitial distribution ~p 0 acquires after (cid:12) steps a distribution ~p (cid:12) given by: ~p (cid:12) = M (cid:12)~p 0.\nAssuming the graph is connected, as (cid:12) ! 1, the Markov chain approaches a unique\ni=1 di, and thus, M 1 = ~(cid:25)1T , where 1 is\na n-dim column vector of all ones. Observe that ~(cid:25) is an eigenvector of M as it is easy to\nshow that M~(cid:25) = ~(cid:25) and the corresponding eigenvalue is 1. Next, we show how to generate\nhierarchical, successively low-ranked approximations for the transition matrix M.\n\nstationary distribution given by ~(cid:25) = diag(D)=Pn\n\nde\ufb01ned to be: dj = Pn\n\ni=1 ai;j = Pn\n\n\f4 Building a Hierarchy of Transition Matrices\nThe goal is to generate a very fast approximation, while simultaneously achieving suf\ufb01cient\naccuracy. For notational ease, we think of M as a \ufb01ne-scale representation and fM as some\ncoarse-scale approximation to be derived here. By coarsening fM further, we can generate\n\nsuccessive levels of the representation hierarchy. We use the stationary distribution ~(cid:25) to\nconstruct a corresponding coarse-scale stationary distribution ~(cid:14). As we just discussed a\ncritical property of the \ufb01ne scale Markov matrix M is that it is similar to the symmetric\nmatrix L and we wish to preserve this property at every level of the representation hierarchy.\n4.1 Deriving Coarse-Scale Stationary Distribution\nWe begin by expressing the stationary distribution ~(cid:25) as a probabilistic mixture of latent\ndistributions. In matrix notation, we have\n\n~(cid:25) = K~(cid:14);\n\n(1)\n\nwhere ~(cid:14) is an unknown mixture coef\ufb01cient vector of length m, K is an n (cid:2) m non-negative\ni=1 Ki;j = 1\nand m (cid:28) n. It is easy to derive a maximum likelihood approximation of ~(cid:14) using an EM\n\nkernel matrix whose columns are latent distributions that each sum to 1: Pn\ntype algorithm [16]. The main step is to \ufb01nd a stationary point(cid:16)~(cid:14); (cid:21)(cid:17) for the Lagrangian:\n\nE (cid:17) (cid:0)\n\nnXi=1\n\n(cid:25)i ln\n\nmXj=1\n\nKi;j(cid:14)j + (cid:21)(cid:18) mXj=1\n\n(cid:14)j (cid:0) 1(cid:19):\n\nAn implicit step in this EM procedure is to compute the the ownership probability ri;j of\nthe jth kernel (or node) at the coarse scale for the ith node on the \ufb01ne scale and is given by\n\n(2)\n\n(4)\n\nri;j =\n\n(cid:14)jKi;j\nk=1 (cid:14)kKi;k\n\nPm\n\n:\n\n(3)\n\nThe EM procedure allows for an update of both ~(cid:14) and the latent distributions in the kernel\nmatrix K (see x8:3:1 in [6]).\nFor initialization, ~(cid:14) is taken to be uniform over the coarse-scale states. But in choosing\nkernels K, we provide a good initialization for the EM procedure. Speci\ufb01cally, the Markov\nmatrix M is diffused using a small number of iterations to get M (cid:12). The diffusion causes\nrandom walks from neighboring nodes to be less distinguishable. This in turn helps us\nselect a small number of columns of M (cid:12) in a fast and greedy way to be the kernel matrix\nK. We defer the exact details on kernel selection to a later section (x4:3).\n4.2 Deriving the Coarse-Scale Transition Matrix\n\nIn order to de\ufb01ne fM, the coarse-scale transition matrix, we break it down into three steps.\n\nFirst, the Markov chain propagation at the coarse scale can be de\ufb01ned as:\n\n~q k+1 = fM ~q k;\n\nwhere ~q k is the coarse scale probability distribution after k steps of the random walk.\nSecond, we expand ~q k into the \ufb01ne scale using the kernels K resulting in a \ufb01ne scale\nprobability distribution ~p k:\n\n(5)\nFinally, we lift ~p k back into the coarse scale by using the ownership probability of the j th\nkernel for the ith node on the \ufb01ne grid:\n\n~p k = K~q k:\n\nq k+1\nj\n\n=\n\nnXi=1\n\nri;jp k\ni\n\n(6)\n\n\fSubstituting for Eqs.(3) and (5) in Eq. 6 gives\n\n(cid:18)\n\n(cid:14)jKi;j\n\nk=1 (cid:14)kKi;k(cid:19) mXt=1\nPm\n\nKi;tq k\nt :\n\n(7)\n\n(8)\n\nK~q k:\n\nWe can write the preceding equation in a matrix form:\n\nq k+1\nj\n\n=\n\nri;j\n\nt =\n\nKi;tq k\n\nnXi=1\n\nmXt=1\n\nnXi=1\n~q k+1 = diag( ~(cid:14) ) K T diag(cid:16)K~(cid:14)(cid:17)(cid:0)1\nfM = diag( ~(cid:14) ) K T diag(cid:16)K~(cid:14)(cid:17)(cid:0)1\n\n(11)\n\nK:\n\nComparing this with Eq. 4, we can derive the transition matrix fM as:\nIt is easy to see that ~(cid:14) = fM~(cid:14), so ~(cid:14) is the stationary distribution for fM. Following the\nde\ufb01nition of fM, and its stationary distribution ~(cid:14), we can generate a symmetric coarse scale\naf\ufb01nity matrix eA given by\n\neA = fMdiag(~(cid:14)) =(cid:16)diag( ~(cid:14) ) K T(cid:17)(cid:18)diag(cid:16)K~(cid:14)(cid:17)(cid:0)1(cid:19)(cid:16)Kdiag(~(cid:14))(cid:17) ;\n\nwhere we substitute for the expression fM from Eq. 9. The coarse-scale af\ufb01nity matrix eA is\n\nthen normalized to get:\n\n(10)\n\n(9)\n\neL = eD(cid:0)1=2eAeD(cid:0)1=2;\n\neD = diag(ed1; ed2; (cid:1) (cid:1) (cid:1) ; edm);\n\nwhere edj is the degree of node j in the coarse-scale graph represented by the matrix eA (see\nx3 for degree de\ufb01nition). Thus, the coarse scale Markov matrix fM is precisely similar to a\nsymmetric matrix eL.\n\n4.3 Selecting Kernels\nFor demonstration purpose, we present the kernel selection details on the image of an eye\nshown below. To begin with, a random walk is de\ufb01ned where each pixel in the test image\nis associated with a vertex of the graph G. The edges in G are de\ufb01ned by the standard\n8-neighbourhood of each pixel. For the demonstrations in this paper, the edge weight\nai;j between neighbouring pixels xi and xj is given by a function of the difference in the\ncorresponding intensities I(xi) and I(xj): ai;j = exp((cid:0)(I(xi) (cid:0) I(xj))2=2(cid:27)2\na), where\n(cid:27)a is set according to the median absolute difference jI(xi) (cid:0) I(xj)j between neighbours\nmeasured over the entire image. The af\ufb01nity matrix A with the edge weights is then used\nto generate a Markov transition matrix M.\nThe kernel selection process we use is fast and greedy. First, the \ufb01ne scale Markov matrix\nM is diffused to M (cid:12) using (cid:12) = 4. The Markov matrix M is sparse as we make the\naf\ufb01nity matrix A sparse. Every column in the diffused matrix M (cid:12) is a potential kernel. To\nfacilitate the selection process, the second step is to rank order the columns of M (cid:12) based\non a probability value in the stationary distribution ~(cid:25). Third, the kernels (i.e. columns of\nM (cid:12)) are picked in such a way that for a kernel Ki all of the neighbours of pixel i which are\nwithin the half-height of the the maximum value in the kernel Ki are suppressed from the\nselection process. Finally, the kernel selection is continued until every pixel in the image is\nwithin a half-height of the peak value of at least one kernel. If M is a full matrix, to avoid\nthe expense of computing M (cid:12) explicitly, random kernel centers can be selected, and only\nthe corresponding columns of M (cid:12) need be computed.\nWe show results from a three-scale hierarchy on the eye image (below). The image has\n25 (cid:2) 20 pixels but is shown here enlarged for clarity. At the \ufb01rst coarse scale 83 kernels\nare picked. The kernels each correspond to a different column in the \ufb01ne scale transition\nmatrix and the pixels giving rise to these kernels are shown numbered on the image.\n\n\f21\n\n1 \n\n2 \n\n11\n\n72\n82\n77\n\n16\n14\n12\n\n44\n35\n28\n31\n45\n36\n37\n40\n38\n41\n39\n42\n32\n43\n33\n34\n\n51\n64\n56\n52\n65\n57\n73\n53\n63\n66\n74\n47\n54\n78\n71\n5859\n6768\n48\n79\n49\n55\n60\n75\n83\n76\n50\n61\n80\n46\n81\n\n4 \n12\n19\n1 \n13\n23\n5 \n29\n14\n6 \n24\n7 \n30\n15\n8 \n2021\n25\n26\n11\n16\n2 \n9 \n27\n17\n3 \n18\n10\n22\nCoarse Scale 1 Coarse Scale 2\n\n18\n20\n22\n\n3 \n4 \n\n5 \n6 \n\n7 \n\n8 \n9 \n10\n\n69\n70\n\n62\n\n15\n\n17\n\n19\n\n13\n\n25\n31\n29\n\n26\n27\n\n23\n\n30\n28\n32\n\n24\n\nUsing these kernels as an initialization, the EM pro-\ncedure derives a coarse-scale stationary distribution ~(cid:14)\n(Eq. 2), while simultaneously updating the kernel ma-\ntrix. Using the newly updated kernel matrix K and the\n\nis generated (Eq. 9). The coarse scale Markov matrix\n\nderived stationary distribution ~(cid:14) a transition matrix fM\nis then diffused to fM (cid:12), again using (cid:12) = 4. The kernel\n\nselection algorithm is reapplied, this time picking 32\nkernels for the second coarse scale. Larger values of (cid:12) cause the coarser level to have fewer\nelements. But the exact number of elements depends on the form of the kernels themselves.\nFor the random experiments that we describe later in x6 we found (cid:12) = 2 in the \ufb01rst iteration\nand 4 thereafter causes the number of kernels to be reduced by a factor of roughly 1=3 to\n1=4 at each level.\nAt coarser levels of the hierarchy, we expect the kernels to get less sparse and so will the\naf\ufb01nity and the transition matrices. In order to promote sparsity at successive levels of\n\nsparsi\ufb01cation step to be not critical. To summarize, we use the stationary distribution ~(cid:25)\n\nthe hierarchy we sparsify eA by zeroing out elements associated with \u201csmall\u201d transition\nprobabilities in fM. However, in the experiments described later in x6, we observe this\nat the \ufb01ne-scale to derive a transition matrix fM, and its stationary distribution ~(cid:14), at the\ncoarse-scale. The coarse scale transition in turn helps to derive an af\ufb01nity matrix eA and\nits normalized version eL. It is obvious that this procedure can be repeated recursively. We\ndescribe next how to use this representation hierarchy for building a fast eigensolver.\n5 Fast EigenSolver\nOur goal in generating a hierarchical representation of a transition matrix is to develop a\nfast, specialized eigen solver for spectral clustering. To this end, we perform a full eigen\ndecomposition of the normalized af\ufb01nity matrix only at the coarsest level. As discussed\nin the previous section, the af\ufb01nity matrix at the coarsest level is not likely to be sparse,\nhence it will need a full (as opposed to a sparse) version of an eigen solver. However it is\ntypically the case that e (cid:20) m (cid:28) n (even in the case of the three-scale hierarchy that we\njust considered) and hence we expect this step to be the least expensive computationally.\nThe resulting eigenvectors are interpolated to the next lower level of the hierarchy by a\nprocess which will be described next. Because the eigen interpolation process between\nevery adjacent pair of scales in the hierarchy is similar, we will assume we have access to\n\nthe leading eigenvectors eU (size: m (cid:2) e) for the normalized af\ufb01nity matrix eL (size: m (cid:2) m)\n\nand describe how to generate the leading eigenvectors U (size: n (cid:2) e), and the leading\neigenvalues S (size: e (cid:2) 1), for the \ufb01ne-scale normalized af\ufb01nity matrix L (size: n (cid:2) n).\nThere are several steps to the eigen interpolation process and in the discussion that follows\nwe refer to the lines in the pseudo-code presented below.\n\nFirst, the coarse-scale eigenvectors eU can be interpolated using the kernel matrix K to\ngenerate U = KeU, an approximation for the \ufb01ne-scale eigenvectors (line 9). Second,\n\ninterpolation alone is unlikely to set the directions of U exactly aligned with UL, the\nvectors one would obtain by a direct eigen decomposition of the \ufb01ne-scale normalized\naf\ufb01nity matrix L. We therefore update the directions in U by applying a small number of\npower iterations with L, as given in lines 13-15.\nfunction (U; S) = CoarseToFine(L; K; eU ; eS)\n\nL; K ( fL is n (cid:2) n and K is n (cid:2) m where m (cid:28) ng\n\neU =eS ( fleading coarse-scale eigenvectors/eigenvalues of eL. eU is of size m (cid:2) e; e (cid:20) mg\n\nU; S ( fleading \ufb01ne-scale eigenvectors/eigenvalues of L. U is n (cid:2) e and S is e (cid:2) 1.g\n\n1: INPUT\n2:\n3:\n4: OUTPUT\n5:\n\n\f1\n\n0.98\n\n0.96\n\n0.94\n\n0.92\n\n0.9\n\n0.88\n\nEigen Spectrum\n\nx 10\u22123\n\n|dl|l\u22121\n\nr\no\nr\nr\n\n \n\nE\ne\nv\ni\nt\n\nl\n\nt\n\n \n\na\ne\nR\ne\nu\no\ns\nb\nA\n\nl\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\nr\no\nr\nr\n\n \n\nE\ne\nv\ni\nt\n\nl\n\na\ne\nR\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nl\n\ne\nu\na\nV\nn\ne\ng\nE\n\n \n\ni\n\n5\n\n10\n\n15\n\n20\nEigen Index\n\n25\n\n30\n\n5\n\n10\n\n15\n\n20\nEigen Index\n\n25\n\n30\n\n5\n\n10\n\n15\n\n20\nEigen Index\n\n25\n\n30\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Hierarchical eigensolver results.\n(a) comparing ground truth eigenvalues SL\n(red circles) with multi-scale eigensolver spectrum S (blue line) (b) Relative absolute error\nbetween eigenvalues:\neigenvectors U derived by the multi-scale eigensolver and the ground truth UL. Observe\nthe slight mismatch in the last few eigenvectors, but excellent agreement in the leading\neigenvectors (see text).\n\n(c) Eigenvector mismatch: 1 (cid:0) diag(cid:0)jU T ULj(cid:1), between\n\njS(cid:0)SLj\n\nSL\n\n6: CONSTANTS:\n7:\n\nTOL = 1e-4;\n\nPOWER ITERS = 50\n\nU ( LU\n\nUold = U fn (cid:2) e matrix, e (cid:28) ng\nfor i = 1 to TPI do\n\n8: TPI = min(cid:16)POWER ITERS; log(e (cid:2) eps=TOL)= log(min(eS))(cid:17) feps: machine accuracyg\n9: U = KeU finterpolation from coarse to \ufb01neg\n10: while not converged do\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22: end while\n\nend for\nU ( Gram-Schmidt(U ) forthogonalize U g\nLe = U T LU fL may be sparse, but Le need not be.g\nUeSeU T\nU ( U Ue fupdate the leading eigenvectors of Lg\nS = diag(Se) fgrab the leading eigenvalues of Lg\ninnerProd = 1 (cid:0) diag( U T\nconverged = max[abs(innerProd)] < TOL\n\ne = svd(Le) feigenanalysis of Le, which is of size e (cid:2) e.g\n\nold U ) f1 is a e (cid:2) 1 vector of all onesg\n\nThe number of power iterations TPI can be bounded as discussed next. Suppose ~v =\nU c where U is a matrix of true eigenvectors and c is a coef\ufb01cient vector for an arbitrary\nvector ~v. After TPI power iterations ~v becomes ~v = Udiag(STPI)c, where S has the exact\neigenvalues. In order for the component of a vector ~v in the direction Ue (the eth column of\nU) not to be swamped by other components, we can limit it\u2019s decay after TPI iterations as\nfollows: (S(e)=S(1))TPI >= e(cid:2) eps=TOL, where S(e) is the exact eth eigenvalue, S(1) =\n1, eps is the machine precision, TOL is requested accuracy. Because we do not have access\nto the exact value S(e) at the beginning of the interpolation procedure, we estimate it from\n\nthe coarse eigenvalues eS. This leads to a bound on the power iterations TPI, as derived on\n\nthe line 9 above. Third, the interpolation process and the power iterations need not preserve\northogonality in the eigenvectors in U. We \ufb01x this by Gram-Schmidt orthogonalization\nprocedure (line 16). Finally, there is a still a problem with power iterations that needs\nto be resolved, in that it is very hard to separate nearby eigenvalues.\nIn particular, for\nthe convergence of the power iterations the ratio that matters is between the (e + 1)st and\neth eigenvalues. So the idea we pursue is to use the power iterations only to separate the\nreduced space of eigenvectors (of dimension e) from the orthogonal subspace (of dimension\nn (cid:0) e). We then use a full SVD on the reduced space to update the leading eigenvectors\nU, and eigenvalues S, for the \ufb01ne-scale (lines 17-20). This idea is similar to computing the\nRitz values and Ritz vectors in a Rayleigh-Ritz method.\n\n\fSL\n\n6 Interpolation Results\nOur multi-scale decomposition code is in Matlab. For the direct eigen decomposition, we\nhave used the Matlab program svds.m which invokes the compiled ARPACKC routine\n[13], with a default convergence tolerance of 1e-10.\nIn Fig. 1a we compare the spectrum S obtained from a three-scale decomposition on the\neye image (blue line) with the ground truth, which is the spectrum SL resulting from direct\neigen decomposition of the \ufb01ne-scale normalized af\ufb01nity matrices L (red circles). There\nis an excellent agreement in the leading eigenvalues. To illustrate this, we show absolute\nrelative error between the spectra: jS(cid:0)SLj\nin Fig. 1b. The spectra agree mostly, except for\nthe last few eigenvalues. For a quantitative comparison between the eigenvectors, we plot\nin Fig. 1c the following measure: 1 (cid:0)diag(jU T ULj), where U is the matrix of eigenvectors\nobtained by the multi-scale approximation, UL is the ground-truth resulting from a direct\neigen decomposition of the \ufb01ne-scale af\ufb01nity matrix L and 1 is a vector of all ones. The\nrelative error plot demonstrates a close match, within the tolerance threshold of 1e-4 that\nwe chose for the multi-scale method, in the leading eigenvector directions between the\ntwo methods. The relative error is high with the last few eigen vectors, which suggests\nthat the power iterations have not clearly separated them from other directions. So, the\nstrategy we suggest is to pad the required number of leading eigen basis by about 20%\nbefore invoking the multi-scale procedure. Obviously, the number of hierarchical stages\nfor the multi-scale procedure must be chosen such that the transition matrix at the coarsest\nscale can accommodate the slight increase in the subspace dimensions. For lack of space\nwe are omitting extra results (see Ch.8 in [6]).\nNext we measure the time the hierarchical eigensolver takes to compute the leading eigen-\nbasis for various input sizes, in comparison with the svds.m procedure [13]. We form\nimages of different input sizes by Gaussian smoothing of i.i.d noise. The Gaussian func-\ntion has a standard deviation of 3 pixels. The edges in graph G are de\ufb01ned by the standard\n8-neighbourhood of each pixel. The edge weights between neighbouring pixels are sim-\nply given by a function of the difference in the corresponding intensities (see x4:3). The\naf\ufb01nity matrix A with the edge weights is then used to generate a Markov transition matrix\nM. The fast eigensolver is run on ten different instances of the input image of a given size\nand the average of these times is reported here. For a fair comparison between the two\nprocedures, we set the convergence tolerance value for the svds.m procedure to be 1e-4,\nthe same as the one used for the fast eigensolver. We found the hierarchical representation\nderived from this tolerance threshold to be suf\ufb01ciently accurate for a novel min-cut based\nsegmentation results that we reported in [8]. Also, the subspace dimensionality is \ufb01xed to\nbe 51 where we expect (and indeed observe) the leading 40 eigenpairs derived from the\nmulti-scale procedure to be accurate. Hence, while invoking svds.m we compute only the\nleading 41 eigenpairs.\nIn the table shown below, the \ufb01rst column corresponds to the number of nodes in the graph,\nwhile the second and third columns report the time taken in seconds by the svds.m pro-\ncedure and the Matlab implementation of the multi-scale eigensolver respectively. The\nfourth column reports the speedups of the multi-scale eigensolver over svds.m procedure\non a standard desktop (Intel P4, 2.5GHz, 1GB RAM). Lowering the tolerance threshold for\nsvds.m made it faster by about 20 (cid:0) 30%. Despite this, the multi-scale algorithm clearly\noutperforms the svds.m procedure. The most expensive step in the multi-scale algorithm\nis the power iteration required in the last stage, that is interpolating eigenvectors from the\n\ufb01rst coarse scale to the required \ufb01ne scale. The complexity is of the order of n (cid:2) e where\ne is the subspace dimensionality and n is the size of the graph. Indeed, from the table we\ncan see that the multi-scale procedure is taking time roughly proportional to n. Deviations\nfrom the linear trend are observed at speci\ufb01c values of n, which we believe are due to the\n\n\fn\n322\n632\n642\n652\n1002\n1272\n1282\n1292\n1602\n2552\n2562\n2572\n5112\n5122\n5132\n6002\n7002\n8002\n\nsvds.m Multi-Scale\n1.5\n4.9\n5.5\n5.1\n13.1\n20.4\n35.2\n20.9\n34.4\n90.3\n188.7\n93.3\n458.8\n739.3\n461.9\n644.2\n1162.4\n1936.6\n\n1.6\n10.8\n20.5\n12.6\n44.2\n91.1\n230.9\n96.9\n179.3\n819.2\n2170.8\n871.7\n7977.2\n20269\n7887.2\n10841.4\n15048.8\n\nSpeedup\n1.1\n2.2\n3.7\n2.5\n3.4\n4.5\n6.6\n4.6\n5.2\n9.1\n11.5\n9.3\n17.4\n27.4\n17.1\n16.8\n12.9\n\nvariations in the dif\ufb01culty of the speci\ufb01c eigen-\nvalue problem (eg. nearly multiple eigenvalues).\nThe hierarchical representation has proven to be\neffective in a min-cut based segmentation algo-\nrithm that we proposed recently [8]. Here we ex-\nplored the use of random walks and associated\nspectral embedding techniques for the automatic\ngeneration of suitable proposal (source and sink)\nregions for a min-cut based algorithm. The multi-\nscale algorithm was used to generate the 40 lead-\ning eigenvectors of large transition matrices (eg.\nsize 20K (cid:2) 20K). In terms of future work, it will\nbe useful to compare our work with other approx-\nimate methods for SVD such as [23].\n\n2002.\n\n1998.\n\nAck: We thank S. Roweis, F. Estrada and M. Sakr for valuable comments.\nReferences\n[1] D. Achlioptas and F. McSherry. Fast Computation of Low-Rank Approximations. STOC, 2001.\n[2] D. Achlioptas et alSampling Techniques for Kernel Methods. NIPS, 2001.\n[3] S. Barnard and H. Simon Fast Multilevel Implementation of Recursive Spectral Bisection for\n\nPartitioning Unstructured Problems. PPSC, 627-632.\n\n[4] M. Belkin et al Laplacian Eigenmaps and Spectral Techniques for Embedding. NIPS, 2001.\n[5] M. Brand et al A unifying theorem for spectral embedding and clustering. AI & STATS, 2002.\n[6] C. Chennubhotla. Spectral Methods for Multi-scale Feature Extraction and Spectral Clustering.\nhttp://www.cs.toronto.edu/\u02dcchakra/thesis.pdf Ph.D Thesis, Department of\nComputer Science, University of Toronto, Canada, 2004.\n\n[7] C. Chennubhotla and A. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS, 2002.\n[8] F. Estrada, A. Jepson and C. Chennubhotla. Spectral Embedding and Min-Cut for Image Seg-\n\nmentation. Manuscript Under Review, 2004.\n\n[9] C. Fowlkes et al Ef\ufb01cient spatiotemporal grouping using the Nystrom method. CVPR, 2001.\n[10] S. Belongie et al Spectral Partitioning with Inde\ufb01nite Kernels using Nystrom app. ECCV,\n\n[11] A. Frieze et al Fast Monte-Carlo Algorithms for \ufb01nding low-rank approximations. FOCS,\n\n[12] Y. Koren et al ACE: A Fast Multiscale Eigenvectors Computation for Drawing Huge Graphs\n\nIEEE Symp. on InfoVis 2002, pp. 137-144\n\n[13] R. B. Lehoucq, D. C. Sorensen and C. Yang. ARPACK User Guide: Solution of Large Scale\n\nEigenvalue Problems by Implicitly Restarted Arnoldi Methods. SIAM 1998.\n\n[14] J. J. Lin. Reduced Rank Approximations of Transition Matrices. AI & STATS, 2002.\n[15] L. Lova\u2019sz. Random Walks on Graphs: A Survey Combinatorics, 1996, 353\u2013398.\n[16] G. J. McLachlan et al Mixture Models: Inference and Applications to Clustering. 1988\n[17] M. Meila and J. Shi. A random walks view of spectral segmentation. AI & STATS, 2001.\n[18] A. Ng, M. Jordan and Y. Weiss. On Spectral Clustering: analysis and an algorithm NIPS, 2001.\n[19] A. Pothen Graph partitioning algorithms with applications to scienti\ufb01c computing. Parallel\n\nNumerical Algorithms, D. E. Keyes et al (eds.), Kluwer Academic Press, 1996.\n\n[20] G. L. Scott et al Feature grouping by relocalization of eigenvectors of the proximity matrix.\n\nBMVC, pg. 103-108, 1990.\n\n[21] E. Sharon et alFast Multiscale Image Segmentation CVPR, I:70-77, 2000.\n[22] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, August, 2000.\n[23] H. Simon et al Low-Rank Matrix Approximation Using the Lanczos Bidiagonalization Process\n\nwith Applications SIAM J. of Sci. Comp. 21(6):2257-2274, 2000.\n\n[24] N. Tishby et al Data clustering by Markovian Relaxation NIPS, 2001.\n[25] C. Williams et al Using the Nystrom method to speed up the kernel machines. NIPS, 2001.\n\n\f", "award": [], "sourceid": 2665, "authors": [{"given_name": "Chakra", "family_name": "Chennubhotla", "institution": null}, {"given_name": "Allan", "family_name": "Jepson", "institution": null}]}