{"title": "Semi-supervised Eigenvectors for Locally-biased Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2528, "page_last": 2536, "abstract": "In many applications, one has information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks nearby that pre-specified target region. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We also provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning.", "full_text": "Semi-supervised Eigenvectors\nfor Locally-biased Learning\n\nToke Jansen Hansen\n\nSection for Cognitive Systems\n\nDTU Informatics\n\nTechnical University of Denmark\n\ntjha@imm.dtu.dk\n\nMichael W. Mahoney\n\nDepartment of Mathematics\n\nStanford University\nStanford, CA 94305\n\nmmahoney@cs.stanford.edu\n\nAbstract\n\nIn many applications, one has side information, e.g., labels that are provided\nin a semi-supervised manner, about a speci\ufb01c target region of a large data set,\nand one wants to perform machine learning and data analysis tasks \u201cnearby\u201d\nthat pre-speci\ufb01ed target region. Locally-biased problems of this sort are partic-\nularly challenging for popular eigenvector-based machine learning and data anal-\nysis tools. At root, the reason is that eigenvectors are inherently global quanti-\nties. In this paper, we address this issue by providing a methodology to construct\nsemi-supervised eigenvectors of a graph Laplacian, and we illustrate how these\nlocally-biased eigenvectors can be used to perform locally-biased machine learn-\ning. These semi-supervised eigenvectors capture successively-orthogonalized di-\nrections of maximum variance, conditioned on being well-correlated with an input\nseed set of nodes that is assumed to be provided in a semi-supervised manner. We\nalso provide several empirical examples demonstrating how these semi-supervised\neigenvectors can be used to perform locally-biased learning.\n\n1\n\nIntroduction\n\nWe consider the problem of \ufb01nding a set of locally-biased vectors that inherit many of the \u201cnice\u201d\nproperties that the leading nontrivial global eigenvectors of a graph Laplacian have\u2014for example,\nthat capture \u201cslowly varying\u201d modes in the data, that are fairly-ef\ufb01ciently computable, that can be\nused for common machine learning and data analysis tasks such as kernel-based and semi-supervised\nlearning, etc.\u2014so that we can perform what we will call locally-biased machine learning in a prin-\ncipled manner.\nBy locally-biased machine learning, we mean that we have a very large data set, e.g., represented as\na graph, and that we have information, e.g., given in a semi-supervised manner, that certain \u201cregions\u201d\nof the data graph are of particular interest. In this case, we may want to focus predominantly on those\nregions and perform data analysis and machine learning, e.g., classi\ufb01cation, clustering, ranking, etc.,\nthat is \u201cbiased toward\u201d those pre-speci\ufb01ed regions. Examples of this include the following.\n\n\u2022 Locally-biased community identi\ufb01cation. In social and information network analysis, one\nmight have a small \u201cseed set\u201d of nodes that belong to a cluster or community of interest [2,\n13]; in this case, one might want to perform link or edge prediction, or one might want to\n\u201cre\ufb01ne\u201d the seed set in order to \ufb01nd other nearby members.\n\n\u2022 Locally-biased image segmentation. In computer vision, one might have a large corpus\nof images along with a \u201cground truth\u201d set of pixels as provided by a face detection algo-\nrithm [7, 14, 15]; in this case, one might want to segment entire heads from the background\nfor all the images in the corpus in an automated manner.\n\n1\n\n\f\u2022 Locally-biased neural connectivity analysis. In functional magnetic resonance imaging ap-\nplications, one might have small sets of neurons that \u201c\ufb01re\u201d in response to some external\nexperimental stimulus [16]; in this case, one might want to analyze the subsequent tem-\nporal dynamics of stimulation of neurons that are \u201cnearby,\u201d either in terms of connectivity\ntopology or functional response.\n\nThese examples present considerable challenges for spectral techniques and traditional eigenvector-\nbased methods. At root, the reason is that eigenvectors are inherently global quantities, thus limiting\ntheir applicability in situations where one is interested in very local properties of the data.\nIn this paper, we provide a methodology to construct what we will call semi-supervised eigenvectors\nof a graph Laplacian; and we illustrate how these locally-biased eigenvectors inherit many of the\nproperties that make the leading nontrivial global eigenvectors of the graph Laplacian so useful in\napplications. To achieve this, we will formulate an optimization ansatz that is a variant of the usual\nglobal spectral graph partitioning optimization problem that includes a natural locality constraint as\nwell as an orthogonality constraint, and we will iteratively solve this problem.\nIn more detail, assume that we are given as input a (possibly weighted) data graph G = (V, E), an\nindicator vector s of a small \u201cseed set\u201d of nodes, a correlation parameter \u03ba \u2208 [0, 1], and a positive\ninteger k. Then, informally, we would like to construct k vectors that satisfy the following bicriteria:\n\ufb01rst, each of these k vectors is well-correlated with the input seed set; and second, those k vectors\ndescribe successively-orthogonalized directions of maximum variance, in a manner analogous to the\nleading k nontrivial global eigenvectors of the graph Laplacian. (We emphasize that the seed set s\nof nodes, the integer k, and the correlation parameter \u03ba are part of the input; and thus they should\nbe thought of as being available in a semi-supervised manner.) Somewhat more formally, our main\nalgorithm, Algorithm 1 in Section 3, returns as output k semi-supervised eigenvectors; each of these\nis the solution to an optimization problem of the form of GENERALIZED LOCALSPECTRAL in Fig-\nure 1, and thus each \u201ccaptures\u201d (say) \u03ba/k of the correlation with the seed set. Our main theoretical\nresult states that these vectors de\ufb01ne successively-orthogonalized directions of maximum variance,\nconditioned on being \u03ba/k-well-correlated with an input seed set s; and that each of these k semi-\nsupervised eigenvectors can be computed quickly as the solution to a system of linear equations.\nFrom a technical perspective, the work most closely related to ours is that of Mahoney et al. [14].\nThe original algorithm of Mahoney et al. [14] introduced a methodology to construct a locally-biased\nversion of the leading nontrivial eigenvector of a graph Laplacian and showed (theoretically and em-\npirically in a social network analysis application) that the resulting vector could be used to partition\na graph in a locally-biased manner. From this perspective, our extension incorporates a natural or-\nthogonality constraint that successive vectors need to be orthogonal to previous vectors. Subsequent\nto the work of [14], [15] applied the algorithm of [14] to the problem of \ufb01nding locally-biased cuts\nin a computer vision application. Similar ideas have also been applied somewhat differently. For\nexample, [2] use locally-biased random walks, e.g., short random walks starting from a small seed\nset of nodes, to \ufb01nd clusters and communities in graphs arising in Internet advertising applications;\n[13] used locally-biased random walks to characterize the local and global clustering structure of\na wide range of social and information networks; [11] developed the Spectral Graph Transducer\n(SGT), that performs transductive learning via spectral graph partitioning. The objectives in both\n[11] and [14] are considered constrained eigenvalue problems, that can be solved by \ufb01nding the\nsmallest eigenvalue of an asymmetric generalized eigenvalue problem, but in practice this procedure\ncan be highly unstable [8]. The SGT reduces the instabilities by performing all calculations in a sub-\nspace spanned by the d smallest eigenvectors of the graph Laplacian, whereas [14] perform a binary\nsearch, exploiting the monotonic relationship between a control parameter and the corresponding\nLagrange multiplier.\nIn parallel, [3] and a large body of subsequent work including [6] used eigenvectors of the graph\nLaplacian to perform dimensionality reduction and data representation, in unsupervised and semi-\nsupervised settings. Many of these methods have a natural interpretation in terms of kernel-based\nlearning [18]. Many of these diffusion-based spectral methods also have a natural interpretation\nin terms of spectral ranking [21]. \u201cTopic sensitive\u201d and \u201cpersonalized\u201d versions of these spectral\nranking methods have also been studied [9, 10]; and these were the motivation for diffusion-based\nmethods to \ufb01nd locally-biased clusters in large graphs [19, 1, 14]. Our optimization ansatz is a\ngeneralization of the linear equation formulation of the PageRank procedure [17, 14, 21], and the\nsolution involves Laplacian-based linear equation solving, which has been suggested as a primitive\n\n2\n\n\fof more general interest in large-scale data analysis [20]. Finally, the form of our optimization\nproblem has similarities to other work in computer vision applications: e.g., [23] and [7] \ufb01nd good\nconductance clusters subject to a set of linear constraints.\n\n2 Background and Notation\nLet G = (V, E, w) be a connected undirected graph with n = |V | vertices and m = |E| edges,\nin which edge {i, j} has non-negative weight wij. In the following, AG \u2208 RV \u00d7V will denote the\nadjacency matrix of G, while DG \u2208 RV \u00d7V will denote the diagonal degree matrix of G, i.e.,\n\nDG(i, i) = di = (cid:80){i,j}\u2208E wij, the weighted degree of vertex i. Moreover, for a set of vertices\n\ni\u2208S di. The Laplacian of G is de\ufb01ned as\n= DG \u2212 AG. (This is also called the combinatorial Laplacian, in which case the normalized\n\nS \u2286 V in a graph, the volume of S is vol(S)\nLG\nLaplacian of G is LG\n\n= (cid:80)\n\ndef\n= D\n\ndef\n\ndef\n\n.)\n\n\u22121/2\nG\n\n\u22121/2\nG LGD\n\nThe Laplacian is the symmetric matrix having quadratic form xT LGx = (cid:80)\n\nij\u2208E wij(xi \u2212 xj)2,\nfor x \u2208 RV . This implies that LG is positive semide\ufb01nite and that the all-one vector 1 \u2208 RV is\nthe eigenvector corresponding to the smallest eigenvalue 0. The generalized eigenvalues of LGx =\n\u03bbiDGx are 0 = \u03bb1 < \u03bb2 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03bbN . We will use v2 to denote smallest non-trivial eigenvector,\ni.e., the eigenvector corresponding to \u03bb2; v3 to denote the next eigenvector; and so on. Finally, for\na matrix A, let A+ denote its (uniquely de\ufb01ned) Moore-Penrose pseudoinverse. For two vectors\nx, y \u2208 Rn, and the degree matrix DG for a graph G, we de\ufb01ne the degree-weighted inner product\nas xT DGy\ni=1 xiyidi. In particular, if a vector x has unit norm, then xT DGx = 1. Given a\nsubset of vertices S \u2286 V , we denote by 1S the indicator vector of S in RV and by 1 the vector in\nRV having all entries set equal to 1.\n\n= (cid:80)n\n\ndef\n\n3 Optimization Approach to Semi-supervised Eigenvectors\n\n3.1 Motivation for the Program\n\nRecall the optimization perspective on how one computes the leading nontrivial global eigenvectors\nof the normalized Laplacian LG. The \ufb01rst nontrivial eigenvector v2 is the solution to the problem\nGLOBALSPECTRAL that is presented on the left of Figure 1. Equivalently, although GLOBALSPEC-\nTRAL is a non-convex optimization problem, strong duality holds for it and it\u2019s solution may be\ncomputed as v2, the leading nontrivial generalized eigenvector of LG. The next eigenvector v3 is\nthe solution to GLOBALSPECTRAL, augmented with the constraint that xT DGv2 = 0; and in gen-\neral the tth generalized eigenvector of LG is the solution to GLOBALSPECTRAL, augmented with\nthe constraints that xT DGvi = 0, for i \u2208 {2, . . . , t \u2212 1}. Clearly, this set of constraints and the\nconstraint xT DG1 = 0 can be written as xT DGQ = 0, where 0 is a (t \u2212 1)-dimensional all-zeros\nvector, and where Q is an n\u00d7 (t\u2212 1) orthogonal matrix whose ith column equals vi (where v1 = 1,\nthe all-ones vector, is the \ufb01rst column of Q).\nAlso presented in Figure 1 is LOCALSPECTRAL, which includes a constraint requiring the solution\nto be well-correlated with an input seed set. This LOCALSPECTRAL optimization problem was in-\ntroduced in [14], where it was shown that the solution to LOCALSPECTRAL may be interpreted as\na locally-biased version of the second eigenvector of the Laplacian. In particular, although LOCAL-\nSPECTRAL is not convex, it\u2019s solution can be computed ef\ufb01ciently as the solution to a set of linear\nequations that generalize the popular Personalized PageRank procedure; in addition, by performing\na sweep cut and appealing to a variant of Cheeger\u2019s inequality, this locally-biased eigenvector can\nbe used to perform locally-biased spectral graph partitioning [14].\n\n3.2 Our Main Algorithm\n\nWe will formulate the problem of computing semi-supervised vectors in terms of a primitive op-\ntimization problem of independent interest. Consider the GENERALIZED LOCALSPECTRAL opti-\nmization problem, as shown in Figure 1. For this problem, we are given a graph G = (V, E), with\nassociated Laplacian matrix LG and diagonal degree matrix DG; an indicator vector s of a small\n\n3\n\n\fGLOBALSPECTRAL\n\nLOCALSPECTRAL\n\nGENERALIZED LOCALSPECTRAL\n\nminimize xT LGx\n\nminimize xT LGx\n\nminimize xT LGx\n\ns.t xT DGx = 1\nxT DG1 = 0\n\ns.t xT DGx = 1\nxT DG1 = 0\n\nxT DGs \u2265 \u221a\n\n\u03ba\n\ns.t xT DGx = 1\nxT DGQ = 0\n\nxT DGs \u2265 \u221a\n\n\u03ba\n\nFigure 1: Left: The usual GLOBALSPECTRAL partitioning optimization problem; the vector achiev-\ning the optimal solution is v2, the leading nontrivial generalized eigenvector of LG with respect\nto DG. Middle: The LOCALSPECTRAL optimization problem, which was originally introduced\nin [14]; for \u03ba = 0, this coincides with the usual global spectral objective, while for \u03ba > 0, this\nproduces solutions that are biased toward the seed vector s. Right: The GENERALIZED LOCAL-\nSPECTRAL optimization problem we introduce that includes both the locality constraint and a more\ngeneral orthogonality constraint. Our main algorithm for computing semi-supervised eigenvectors\nwill iteratively compute the solution to GENERALIZED LOCALSPECTRAL for a sequence of Q ma-\ntrices. In all three cases, the optimization variable is x \u2208 Rn.\n\n\u221a\n\n\u201cseed set\u201d of nodes; a correlation parameter \u03ba \u2208 [0, 1]; and an n\u00d7\u03bd constraint matrix Q that may be\nassumed to be an orthogonal matrix. We will assume (without loss of generality) that s is properly\nnormalized and orthogonalized so that sT DGs = 1 and sT DG1 = 0. While s can be a general unit\nvector orthogonal to 1, it may be helpful to think of s as the indicator vector of one or more vertices\nin V , corresponding to the target region of the graph.\nIn words, the problem GENERALIZED LOCALSPECTRAL asks us to \ufb01nd a vector x \u2208 Rn that min-\nimizes the variance xT LGx subject to several constraints: that x is unit length; that x is orthogonal\nto the span of Q; and that x is\n\u03ba-well-correlated with the input seed set vector s. In our applica-\ntion of GENERALIZED LOCALSPECTRAL to the computation of semi-supervised eigenvectors, we\nwill iteratively compute the solution to GENERALIZED LOCALSPECTRAL, updating Q to contain\nthe already-computed semi-supervised eigenvectors. That is, to compute the \ufb01rst semi-supervised\neigenvector, we let Q = 1, i.e., the n-dimensional all-ones vector, which is the trivial eigenvector of\nLG, in which case Q is an n\u00d7 1 matrix; and to compute each subsequent semi-supervised eigenvec-\ntor, we let the columns of Q consist of 1 and the other semi-supervised eigenvectors found in each\nof the previous iterations.\nTo show that GENERALIZED LOCALSPECTRAL is ef\ufb01ciently-solvable, note that it is a quadratic\nprogram with only one quadratic constraint and one linear equality constraint. In order to remove the\nequality constraint, which will simplify the problem, let\u2019s change variables by de\ufb01ning the n\u00d7(n\u2212\u03bd)\nmatrix F as {x : QT DGx = 0} = {x : x = F y}. That is, F is a span for the null space of QT ;\nand we will take F to be an orthogonal matrix. Then, with respect to the y variable, GENERALIZED\nLOCALSPECTRAL becomes\n\nminimize\n\nyT F T LGF y\n\ny\n\n(1)\n\n(2)\n\nIn terms of the variable x, the solution to this optimization problem is of the form\n\nsubject to yT F T DGF y = 1,\n\u03ba.\n\nyT F T DGs \u2265 \u221a\nx\u2217 = cF(cid:0)F T (LG \u2212 \u03b3DG) F(cid:1)+\n= c(cid:0)F F T (LG \u2212 \u03b3DG) F F T(cid:1)+\n\nF T DGs\n\nDGs,\n\u221a\n\nfor a normalization constant c \u2208 (0,\u221e) and for some \u03b3 that depends on\n\u03ba. The second line follows\nfrom the \ufb01rst since F is an n\u00d7 (n\u2212 \u03bd) orthogonal matrix. This so-called \u201cS-procedure\u201d is described\nin greater detail in Chapter 5 and Appendix B of [4]. The signi\ufb01cance of this is that, although it is\na non-convex optimization problem, the GENERALIZED LOCALSPECTRAL problem can be solved\nby solving a linear equation, in the form given in Eqn. (2).\nReturning to our problem of computing semi-supervised eigenvectors, recall that, in addition to the\ninput for the GENERALIZED LOCALSPECTRAL problem, we need to specify a positive integer k\nthat indicates the number of vectors to be computed. In the simplest case, we would assume that\n\n4\n\n\frequire that each vector is(cid:112)\u03ba/k-well-correlated with the input seed set vector s; but this assumption\n\nwe would like the correlation to be \u201cevenly distributed\u201d across all k vectors, in which case we will\n\ncan easily be relaxed, and thus Algorithm 1 is formulated more generally as taking a k-dimensional\nvector \u03ba = [\u03ba1, . . . , \u03bak]T of correlation coef\ufb01cients as input.\nTo compute the \ufb01rst semi-supervised eigenvector, we will let Q = 1, the all-ones vector, in which\ncase the \ufb01rst nontrivial semi-supervised eigenvector is\n\n1 = c (LG \u2212 \u03b31DG)+ DGs,\nx\u2217\n\n(3)\n\nwhere \u03b31 is chosen to saturate the part of the correlation constraint along the \ufb01rst direction. (Note\nthat the projections F F T from Eqn. (2) are not present in Eqn. (3) since by design sT DG1 = 0.)\nThat is, to \ufb01nd the correct setting of \u03b31, it suf\ufb01ces to perform a binary search over the possible\nvalues of \u03b31 in the interval (\u2212vol(G), \u03bb2(G)) until the correlation constraint is satis\ufb01ed, that is,\nuntil (sT DGx)2 is suf\ufb01ciently close to \u03ba2\nTo compute subsequent semi-supervised eigenvectors, i.e., at steps t = 2, . . . , k if one ultimately\nwants a total of k semi-supervised eigenvectors, then one lets Q be the n \u00d7 (t \u2212 1) matrix with \ufb01rst\ncolumn equal to 1 and with jth column, for i = 2, . . . , t \u2212 1, equal to x\u2217\nj\u22121 (where we emphasize\nthat x\u2217\nt\u22121],\nwhere x\u2217\ni are successive semi-supervised eigenvectors, and the projection matrix F F T is of the\nform F F T = I \u2212 DGQ(QT DGDGQ)\u22121QT DG, due to the degree-weighted inner norm. Then, by\nEqn. (2), the tth semi-supervised eigenvector takes the form\n\nj\u22121 is a vector not an element of a vector). That is, Q is of the form Q = [1, x\u2217\n\n1, see [8, 14].\n\n1, . . . , x\u2217\n\nt = c(cid:0)F F T (LG \u2212 \u03b3tDG)F F T(cid:1)+\n\nx\u2217\n\nDGs.\n\n(4)\n\nF F T \u2190 I \u2212 DGQ(QT DGDGQ)\u22121QT DG\n\nAlgorithm 1 Semi-supervised eigenvectors\nInput: LG, DG, s, \u03ba = [\u03ba1, . . . , \u03bak]T , \u0001\nRequire: sT DG1 = 0, sT DGs = 1, \u03baT 1 \u2264 1\n1: Q = [1]\n2: for t = 1 to k do\n3:\n4: (cid:62) \u2190 \u03bb2 where F F T LGF F T v2 = \u03bb2F F T DGF F T v2\n5: \u22a5 \u2190 \u2212vol(G)\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end for\n\n\u03b3t \u2190 (\u22a5 + (cid:62))/2 (Binary search over \u03b3t)\nxt \u2190 (F F T (LG \u2212 \u03b3tDG)F F T )+F F T DGs\nNormalize xt such that xT\nif (xT\n\nt DGs)2 > \u03bat then \u22a5 \u2190 \u03b3t else (cid:62) \u2190 \u03b3t end if\nt DGs)2 \u2212 \u03bat(cid:107) \u2264 \u0001 or (cid:107)(\u22a5 + (cid:62))/2 \u2212 \u03b3t(cid:107) \u2264 \u0001\n\nuntil (cid:107)(xT\nAugment Q with x\u2217\n\nt by letting Q = [Q, x\u2217\nt ].\n\nt DGxt = 1\n\nrepeat\n\nIn more detail, Algorithm 1 presents pseudo-code for our main algorithm for computing semi-\nsupervised eigenvectors. Several things should be noted about our implementation. First, note\nthat we implicitly compute the projection matrix F F T . Second, a na\u00a8\u0131ve approach to Eqn. (2) does\nnot immediately lead to an ef\ufb01cient solution, since DGs will not be in the span of (F F T (LG \u2212\n\u03b3DG)F F T ), thus leading to a large residual. By changing variables so that x = F F T y, the solu-\ntion becomes x\u2217 \u221d F F T (F F T (LG \u2212 \u03b3DG)F F T )+F F T DGs. Since F F T is a projection matrix,\nthis expression is equivalent to x\u2217 \u221d (F F T (LG \u2212 \u03b3DG)F F T )+F F T DGs. Third, we exploit that\nF F T (LG \u2212 \u03b3iDG)F F T is an SPSD matrix, and we apply the conjugate gradient method, rather\nthan computing the explicit pseudoinverse. That is, in the implementation we never represent the\ndense matrix F F T , but instead we treat it as an operator and we simply evaluate the result of ap-\nplying a vector to it on either side. Fourth, we use that \u03bb2 can never decrease (here we refer to\n\u03bb2 as the smallest non-zero eigenvalue of the modi\ufb01ed matrix), so we only recalculate the upper\nbound for the binary search when an iteration saturates without satisfying (cid:107)(xT\nt DGs)2 \u2212 \u03bat(cid:107) \u2264 \u0001.\nIn case of saturation one can for instance recalculate \u03bb2 iteratively by using the inverse iteration\nmethod, vk+1\n2 , and normalizing such\nthat (vk+1\n\n\u221d (F F T LGF F T \u2212 \u03bbest\n2 = 1.\n\n2 F F T DGF F T )+F F T DGF F T vk\n\n2\n)T vk+1\n\n2\n\n5\n\n\f4\n\nIllustrative Empirical Results\n\nIn this section, we will provide a detailed empirical evaluation of our method of semi-supervised\neigenvectors and how they can be used for locally-biased machine learning. Our goal will be two-\nfold: \ufb01rst, to illustrate how the \u201cknobs\u201d of our method work; and second, to illustrate the usefulness\nof the method in a real application. To do so, we will consider:\n\n\u2022 Toy data. In Section 4.1, we will consider one-dimensional examples of the popular \u201csmall\nworld\u201d model [22]. This is a parameterized family of models that interpolates between\nlow-dimensional grids and random graphs; and, as such, it will allow us to illustrate the\nbehavior of our method and it\u2019s various parameters in a controlled setting.\n\n\u2022 Handwritten image data. In Section 4.2, we will consider the data from the MNIST digit\ndata set [12]. These data have been widely-studied in machine learning and related areas\nand they have substantial \u201clocal heterogeneity\u201d; and thus these data will allow us to illus-\ntrate how our method may be used to perform locally-biased versions of common machine\nlearning tasks such as smoothing, clustering, and kernel construction.\n\n4.1 Small-world Data\n\nTo illustrate how the \u201cknobs\u201d of our method work, and in particular how \u03ba and \u03b3 interplay, we con-\nsider data constructed from the so-called small-world model. To demonstrate how semi-supervised\neigenvectors can focus on speci\ufb01c target regions of a data graph to capture slowest modes of local\nvariation, we plot semi-supervised eigenvectors around illustrations of (non-rewired and rewired)\nrealizations of the small-world graph; see Figure 2.\n\np = 0,\n\n\u03bb2 = 0.000011,\n\u03bb3 = 0.000011,\n\u03bb4 = 0.000046,\n\u03bb5 = 0.000046.\n\np = 0.01,\n\n\u03bb2 = 0.000149,\n\u03bb3 = 0.000274,\n\u03bb4 = 0.000315,\n\u03bb5 = 0.000489.\n\np = 0.01, \u03ba = 0.005,\n\n\u03b31 = 0.000047,\n\u03b32 = 0.000052,\n\u03b33 = \u22120.000000,\n\u03b34 = \u22120.000000.\n\np = 0.01, \u03ba = 0.05,\n\u03b31 = \u22120.004367,\n\u03b32 = \u22120.001778,\n\u03b33 = \u22120.001665,\n\u03b34 = \u22120.000822.\n\n(a) Global eigenvectors\n\n(b) Global eigenvectors\n\n(c) Semi-supervised eigenvectors\n\n(d) Semi-supervised eigenvectors\n\nFigure 2: In each case, (a-d) the data consist of 3600 nodes, each connected to it\u2019s 8 nearest-\nneighbors. In the center of each sub\ufb01gure, we show the nodes (blue) and edges (black and light\ngray are the local edges, and blue are the randomly-rewired edges). In each sub\ufb01gure, we wrap a\nplot (black x-axis and gray background) visualizing the 4 smallest semi-supervised eigenvectors,\nallowing us to see the effect of random edges (different values of rewiring probability p) and degree\nof localization (different values of \u03ba). Eigenvectors are color coded as blue, red, yellow, and green,\nstarting with the one having the smallest eigenvalue. See the main text for more details.\nIn Figure 2.a, we show a graph with no randomly-rewired edges (p = 0) and a locality parameter\n\u03ba such that the global eigenvectors are obtained. This yields a symmetric graph with eigenvectors\ncorresponding to orthogonal sinusoids, i.e., for all eigenvectors, except the all-ones with eigenvalue\n0, the algebraic multiplicity is 2, i.e., the \ufb01rst two capture the slowest mode of variation and cor-\nrespond to a sine and cosine with equal random phase-shift (rotational ambiguity). In Figure 2.b,\nrandom edges have been added with probability p = 0.01 and the locality parameter \u03ba is still cho-\nsen such that the global eigenvectors of the rewired graph are obtained. In particular, note small\nkinks in the eigenvectors at the location of the randomly added edges. Since the graph is no longer\nsymmetric, all of the visualized eigenvectors have algebraic multiplicity 1. Moreover, note that the\nslow mode of variation in the interval on the top left; a normalized-cut based on the leading global\neigenvector would extract this region since the remainder of the ring is more well-connected due\nto the degree of rewiring. In Figure 2.c, we see the same graph realization as in Figure 2.b, except\nthat the semi-supervised eigenvectors have a seed node at the top of the circle and the correlation\n\n6\n\n\fparameter \u03bat = 0.005. Note that, like the global eigenvectors, the local approach produces modes\nof increasing variation. In addition, note that the neighborhood around \u201c11 o-clock\u201d contains more\nmass, when compared with Figure 2.b; the reason for this is that this region is well-connected with\nthe seed via a randomly added edge. Above the visualization we also show the \u03b3t that saturates \u03bat,\ni.e., \u03b3t is the Lagrange multiplier that de\ufb01nes the effective correlation \u03bat. Not shown is that if we\nkept reducing \u03ba, then \u03b3t would tend towards \u03bbt+1, and the respective semi-supervised eigenvector\nwould tend towards the global eigenvector. Finally, in Figure 2.d, the desired correlation is increased\nto \u03ba = 0.05 (thus decreasing the value of \u03b3t), making the different modes of variation more local-\nized in the neighborhood of the seed. It should be clear that, in addition to being determined by the\nlocality parameter, we can think of \u03b3 as a regularizer biasing the global eigenvectors towards the\nregion near the seed set.\n\n4.2 MNIST Digit Data\n\n\u03c32\ni\n\n(cid:107)xi \u2212 xj(cid:107)2), where \u03c32\n\nWe now demonstrate the semi-supervised eigenvectors as a feature extraction preprocessing step in\na machine learning setting. We consider the well-studied MNIST dataset containing 60000 training\ndigits and 10000 test digits ranging from 0 to 9. We construct the complete 70000 \u00d7 70000 k-NN\ngraph with k = 10 and with edge weights given by wij = exp(\u2212 4\ni being\nthe Euclidean distance to it\u2019s nearest neighbor, and we de\ufb01ne the graph Laplacian in the usual way.\nWe evaluate the semi-supervised eigenvectors in a transductive learning setting by disregarding the\nmajority of labels in the entire training data. We then use a few samples from each class to seed\nour semi-supervised eigenvectors, and a few others to train a downstream classi\ufb01cation algorithm.\nHere we choose to apply the SGT of [11] for two main reasons. First, the transductive classi\ufb01er is\ninherently designed to work on a subset of global eigenvectors of the graph Laplacian, making it\nideal for validating that our localized basis constructed by the semi-supervised eigenvectors can be\nmore informative when we are solely interested in the \u201clocal heterogeneity\u201d near a seed set. Second,\nusing the SGT based on global eigenvectors is a good point of comparison, because we are only\ninterested in the effect of our subspace representation. (If we used one type of classi\ufb01er in the local\nsetting, and another in the global, the classi\ufb01cation accuracy that we measure would obviously be\nbiased.) As in [11], we normalize the spectrum of both global and semi-supervised eigenvectors\nby replacing the eigenvalues with some monotonically increasing function. We use \u03bbi = i2\nk2 , i.e.,\nfocusing on ranking among smallest cuts; see [5]. Furthermore, we \ufb01x the regularization parameter\nof the SGT to c = 3200, and for simplicity we \ufb01x \u03b3 = 0 for all semi-supervised eigenvectors,\nimplicitly de\ufb01ning the effective \u03ba = [\u03ba1, . . . , \u03bak]T . Clearly, other correlation distributions and\nvalues of \u03b3 may yield subspaces with even better discriminative properties1.\n\nLabeled points\n\n1 : 1\n1 : 10\n5 : 50\n\n10 : 100\n50 : 500\n\n1\n\n0.39\n0.30\n0.12\n0.09\n0.03\n\n2\n\n0.39\n0.31\n0.15\n0.10\n0.03\n\n4\n\n0.38\n0.25\n0.09\n0.07\n0.03\n\n6\n\n0.38\n0.23\n0.08\n0.06\n0.03\n\n#Semi-supervised eigenvectors for SGT\n\n8\n\n0.38\n0.19\n0.07\n0.05\n0.03\n\n10\n0.36\n0.15\n0.06\n0.05\n0.03\n\n1\n\n0.50\n0.49\n0.49\n0.49\n0.49\n\n#Global eigenvectors for SGT\n20\n5\n0.27\n0.06\n0.05\n0.04\n0.04\n\n10\n0.36\n0.09\n0.08\n0.07\n0.07\n\n15\n0.27\n0.08\n0.07\n0.06\n0.06\n\n0.48\n0.36\n0.09\n0.08\n0.10\n\n25\n0.19\n0.06\n0.04\n0.04\n0.04\n\nTable 1: Classi\ufb01cation error for the SGT based on respectively semi-supervised and global eigenvec-\ntors. The \ufb01rst column from the left encodes the con\ufb01guration, e.g., 1:10 interprets as 1 seed and 10\ntraining samples from each class (total of 22 samples - for the global approach these are all used for\ntraining). When the seed is well determined and the number of training samples moderate (50:500)\na single semi-supervised eigenvector is suf\ufb01cient, where for less data we bene\ufb01t from using multiple\nsemi-supervised eigenvectors. All experiments have been repeated 10 times.\nHere, we consider the task of discriminating between fours and nines, as these two classes tend to\noverlap more than other combinations. (A closed four usually resembles nine more than an \u201copen\u201d\nfour.) Hence, we expect localization on low order global eigenvectors, meaning that class separation\nwill not be evident in the leading global eigenvector, but instead will be \u201cburied\u201d further down the\nspectrum. Thus, this will illustrate how semi-supervised eigenvectors can represent relevant hetero-\ngeneities in a local subspace of low dimensionality. Table 1 summarizes our classi\ufb01cation results\nbased on respectively semi-supervised and global eigenvectors. Finally, Figure 3 and 4 illustrates\ntwo realizations for the 1:10 con\ufb01guration, where the training samples are \ufb01xed, but where we vary\n\n1A thorough analysis regarding the importance of this parameter will appear in the journal version.\n\n7\n\n\fthe seed nodes, to demonstrate the in\ufb02uence of the seed. See the caption in these \ufb01gures for further\ndetails.\n\ns+ = { }\n}\n\nl\u2212 = {\n\nl+ = {\n\ns\u2212 = { }\n}\n\n\u2212\u2192\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\u2212\n\na\nt\na\nd\nt\ns\ne\nT\n\n\u2212\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2190\u2212\n\n\u2212\u2192\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\u2212\n\na\nt\na\nd\n\nt\ns\ne\nT\n\n\u2212\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n\u2190\u2212\n\nFigure 3: Left: Shows a subset of the classi\ufb01cation results for the SGT based on 5 semi-supervised\neigenvectors seeded in s+ and s\u2212, and trained using samples l+ and l\u2212. Misclassi\ufb01cations are\nmarked with black frames. Right: Visualizes all test data spanned by the \ufb01rst 5 semi-supervised\neigenvectors, by plotting each component as a function of the others. Red (blue) points correspond\nto 4 (9), whereas green points correspond to remaining digits. As the seed nodes are good repre-\nsentatives, we note that the eigenvectors provide a good class separation. We also plot the error as\na function of local dimensionality, as well as the unexplained correlation, i.e., initial components\nexplain the majority of the correlation with the seed (effect of \u03b3 = 0). The particular realization\nbased on the leading 5 semi-supervised eigenvectors yields an error of \u2248 0.03 (dashed circle).\n\nl+ = {\n\ns+ = { }\n}\n\nl\u2212 = {\n\ns\u2212 = { }\n}\n\nFigure 4: See the general description in Figure 3. Here we illustrate an instance where the s+ shares\nmany similarities with s\u2212, i.e., s+ is on the boundary of the two classes. This particular realization\nachieves a classi\ufb01cation error of \u2248 0.30 (dashed circle).\nIn this constellation we \ufb01rst discover\nlocalization on low order semi-supervised eigenvectors (\u2248 12 eigenvectors), which is comparable\nto the error based on global eigenvectors (see Table 1), i.e., further down the spectrum we recover\nfrom the bad seed and pickup the relevant mode of variation.\n\nIn summary: We introduced the concept of semi-supervised eigenvectors that are biased towards\nlocal regions of interest in a large data graph. We demonstrated the feasibility on a well-studied\ndataset and found that our approach leads to more compact subspace representations by extracting\ndesired local heterogeneities. Moreover, the algorithm is scalable as the eigenvectors are computed\nby the solution to a sparse system of linear equations, preserving the low O(m) space complexity.\nFinally, we foresee that the approach will prove useful in a wide range of data analysis \ufb01elds, due to\nthe algorithm\u2019s speed, simplicity, and stability.\n\n8\n\n1234567891011121314150.080.070.060.050.030.020.030.030.030.030.030.030.030.030.03Classi\ufb01cation errorUnexplained correlation1 vs. 21 vs. 31 vs. 41 vs. 52 vs. 32 vs. 42 vs. 53 vs. 43 vs. 54 vs. 5#Semi-supervised eigenvectors0.60.50.40.30.20.101234567891011121314150.480.310.300.300.300.290.270.240.200.150.100.040.040.040.04Classi\ufb01cation errorUnexplained correlation1 vs. 21 vs. 31 vs. 41 vs. 52 vs. 32 vs. 42 vs. 53 vs. 43 vs. 54 vs. 5#Semi-supervised eigenvectors0.60.50.40.30.20.10\fReferences\n[1] R. Andersen, F.R.K. Chung, and K. Lang. Local graph partitioning using PageRank vectors.\nIn FOCS \u201906: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer\nScience, pages 475\u2013486, 2006.\n\n[2] R. Andersen and K. Lang. Communities from seed sets. In WWW \u201906: Proceedings of the 15th\n\nInternational Conference on World Wide Web, pages 223\u2013232, 2006.\n\n[3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre-\n\nsentation. Neural Computation, 15(6):1373\u20131396, 2003.\n\n[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge,\n\nUK, 2004.\n\n[5] O. Chapelle, J. Weston, and B. Sch\u00a8olkopf. Cluster Kernels for Semi-Supervised Learning. In\n\nBecker, editor, NIPS 2002, volume 15, pages 585\u2013592, Cambridge, MA, USA, 2003.\n\n[6] R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W. Zucker.\nGeometric diffusions as a tool for harmonic analysis and structure de\ufb01nition in data: Diffusion\nmaps. Proc. Natl. Acad. Sci. USA, 102(21):7426\u20137431, 2005.\n\n[7] A. P. Eriksson, C. Olsson, and F. Kahl. Normalized cuts revisited: A reformulation for seg-\nmentation with linear grouping constraints. In Proceedings of th 11th International Conference\non Computer Vision, pages 1\u20138, 2007.\n\n[8] W. Gander, G. H. Golub, and U. von Matt. A constrained eigenvalue problem. Linear Algebra\n\nand its Applications, 114/115:815\u2013839, 1989.\n\n[9] T.H. Haveliwala. Topic-sensitive PageRank: A context-sensitive ranking algorithm for web\n\nsearch. IEEE Transactions on Knowledge and Data Engineering, 15(4):784\u2013796, 2003.\n\n[10] G. Jeh and J. Widom. Scaling personalized web search. In WWW \u201903: Proceedings of the 12th\n\nIn Proceedings of the\n\nInternational Conference on World Wide Web, pages 271\u2013279, 2003.\n[11] T. Joachims. Transductive learning via spectral graph partitioning.\n\nTwentieth International Conference on Machine Learning (ICML-2003), 2003.\n\n[12] Y. Lecun and C. Cortes. The MNIST database of handwritten digits.\n[13] J. Leskovec, K.J. Lang, A. Dasgupta, and M.W. Mahoney. Statistical properties of community\nIn WWW \u201908: Proceedings of the 17th\n\nstructure in large social and information networks.\nInternational Conference on World Wide Web, pages 695\u2013704, 2008.\n\n[14] M. W. Mahoney, L. Orecchia, and N. K. Vishnoi. A local spectral method for graphs: with\napplications to improving graph partitions and exploring data graphs locally. Technical report,\n2009. Preprint: arXiv:0912.0681.\n\n[15] S. Maji, N. K. Vishnoi, and J. Malik. Biased normalized cuts. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 2057\u20132064, 2011.\n\n[16] K.A. Norman, S.M. Polyn, G.J. Detre, and J.V. Haxby. Beyond mind-reading: multi-voxel\n\npattern analysis of fmri data. Trends in Cognitive Sciences, 10(9):424\u201330, 2006.\n\n[17] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing\n\norder to the web. Technical report, Stanford InfoLab, 1999.\n\n[18] B. Scholkopf and A.J. Smola. Learning with Kernels: Support Vector Machines, Regulariza-\n\ntion, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.\n\n[19] D.A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph\nsparsi\ufb01cation, and solving linear systems. In STOC \u201904: Proceedings of the 36th annual ACM\nSymposium on Theory of Computing, pages 81\u201390, 2004.\n\n[20] S.-H. Teng. The Laplacian paradigm: Emerging algorithms for massive graphs. In Proceedings\nof the 7th Annual Conference on Theory and Applications of Models of Computation, pages\n2\u201314, 2010.\n\n[21] S. Vigna. Spectral ranking. Technical report. Preprint: arXiv:0912.0238 (2009).\n[22] D.J. Watts and S.H. Strogatz. Collective dynamics of small-world networks. Nature, 393:440\u2013\n\n442, 1998.\n\n[23] S. X. Yu and J. Shi. Grouping with bias. In Annual Advances in Neural Information Processing\n\nSystems 14: Proceedings of the 2001 Conference, pages 1327\u20131334, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1209, "authors": [{"given_name": "Toke", "family_name": "Hansen", "institution": null}, {"given_name": "Michael", "family_name": "Mahoney", "institution": null}]}