Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Chakra Chennubhotla, Allan Jepson*

We show how to build hierarchical, reduced-rank representation for large stochastic matrices and use this representation to design an efﬁcient al- gorithm for computing the largest eigenvalues, and the corresponding eigenvectors. In particular, the eigen problem is ﬁrst solved at the coars- est level of the representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy. A small number of power iterations are employed at each stage to correct the eigen solution. The typical speedups obtained by a Matlab implementation of our fast eigensolver over a standard sparse matrix eigensolver [13] are at least a factor of ten for large image sizes. The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8].

1 Spectral Methods Graph-theoretic spectral methods have gained popularity in a variety of application do- mains: segmenting images [22]; embedding in low-dimensional spaces [4, 5, 8]; and clus- tering parallel scientiﬁc computation tasks [19]. Spectral methods enable the study of prop- erties global to a dataset, using only local (pairwise) similarity or afﬁnity measurements be- tween the data points. The global properties that emerge are best understood in terms of a random walk formulation on the graph. For example, the graph can be partitioned into clus- ters by analyzing the perturbations to the stationary distribution of a Markovian relaxation process deﬁned in terms of the afﬁnity weights [17, 18, 24, 7]. The Markovian relaxation process need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. In this paper we consider the practical application of spectral methods to large datasets. In particular, the eigen decomposition can be very expensive, on the order of O(n3), where n is the number of nodes in the graph. While it is possible to compute analytically the ﬁrst eigenvector (see x3 below), the remaining subspace of vectors (necessary for say clustering) has to be explicitly computed. A typical approach to dealing with this difﬁculty is to ﬁrst sparsify the links in the graph [22] and then apply an efﬁcient eigensolver [13, 23, 3]. In comparison, we propose in this paper a specialized eigensolver suitable for large stochas- tic matrices with known stationary distributions. In particular, we exploit the spectral prop- erties of the Markov transition matrix to generate hierarchical, successively lower-ranked approximations to the full transition matrix. The eigen problem is solved directly at the coarsest level of representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy, using a small number of power iterations to correct the solution at each stage.

2 Previous Work One approach to speeding up the eigen decomposition is to use the fact that the columns of the afﬁnity matrix are typically correlated. The idea then is to pick a small number of representative columns to perform eigen decomposition via SVD. For example, in the Nystrom approximation procedure, originally proposed for integral eigenvalue problems, the idea is to randomly pick a small set of m columns; generate the corresponding afﬁnity matrix; solve the eigenproblem and ﬁnally extend the solution to the complete graph [9, 10]. The Nystrom method has also been recently applied in the kernel learning methods for fast Gaussian process classiﬁcation and regression [25]. Other sampling-based approaches include the work reported in [1, 2, 11]. Our starting point is the transition matrix generated from afﬁnity weights and we show how building a representational hierarchy follows naturally from considering the stochas- tic matrix. A closely related work is the paper by Lin on reduced rank approximations of transition matrices [14]. We differ in how we approximate the transition matrices, in par- ticular our objective function is computationally less expensive to solve. In particular, one of our goals in reducing transition matrices is to develop a fast, specialized eigen solver for spectral clustering. Fast eigensolving is also the goal in ACE [12], where successive levels in the hierarchy can potentially have negative afﬁnities. A graph coarsening process for clustering was also pursued in [21, 3]. 3 Markov Chain Terminology We ﬁrst provide a brief overview of the Markov chain terminology here (for more details see [17, 15, 6]). We consider an undirected graph G = (V; E) with vertices vi, for i = f1; : : : ; ng, and edges ei;j with non-negative weights ai;j. Here the weight ai;j represents the afﬁnity between vertices vi and vj. The afﬁnities are represented by a non-negative, symmetric n (cid:2) n matrix A having weights ai;j as elements. The degree of a node j is j=1 aj;i, where we deﬁne D = diag(d1; : : : ; dn). A Markov chain is deﬁned using these afﬁnities by setting a transition probability matrix M = AD(cid:0)1, where the columns of M each sum to 1. The transition probability matrix deﬁnes the random walk of a particle on the graph G. The random walk need never be explicitly carried out; instead, it can be analytically ex- pressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. Because the stochastic matrices need not be symmetric in general, a direct eigen decomposition step is not preferred for reasons of instability. This problem is easily circum- vented by considering a normalized afﬁnity matrix: L = D(cid:0)1=2AD(cid:0)1=2, which is related to the stochastic matrix by a similarity transformation: L = D(cid:0)1=2M D1=2. Because L is symmetric, it can be diagonalized: L = U (cid:3)U T , where U = [~u1; ~u2; (cid:1) (cid:1) (cid:1) ; ~un] is an orthogonal set of eigenvectors and (cid:3) is a diagonal matrix of eigenvalues [(cid:21)1; (cid:21)2; (cid:1) (cid:1) (cid:1) ; (cid:21)n] sorted in decreasing order. The eigenvectors have unit length k~ukk = 1 and from the form of A and D it can be shown that the eigenvalues (cid:21)i 2 ((cid:0)1; 1], with at least one eigenvalue equal to one. Without loss of generality, we take (cid:21)1 = 1. Because L and M are similar we can perform an eigen decomposition of the Markov transition matrix as: M = D1=2LD(cid:0)1=2 = D1=2U (cid:3) U T D(cid:0)1=2. Thus an eigenvector ~u of L corresponds to an eigenvector D1=2~u of M with the same eigenvalue (cid:21). The Markovian relaxation process after (cid:12) iterations, namely M (cid:12), can be represented as: M (cid:12) = D1=2U (cid:3)(cid:12)U T D(cid:0)1=2. Therefore, a particle undertaking a random walk with an initial distribution ~p 0 acquires after (cid:12) steps a distribution ~p (cid:12) given by: ~p (cid:12) = M (cid:12)~p 0. Assuming the graph is connected, as (cid:12) ! 1, the Markov chain approaches a unique i=1 di, and thus, M 1 = ~(cid:25)1T , where 1 is a n-dim column vector of all ones. Observe that ~(cid:25) is an eigenvector of M as it is easy to show that M~(cid:25) = ~(cid:25) and the corresponding eigenvalue is 1. Next, we show how to generate hierarchical, successively low-ranked approximations for the transition matrix M.

stationary distribution given by ~(cid:25) = diag(D)=Pn

deﬁned to be: dj = Pn

Do not remove: This comment is monitored to verify that the site is working properly