{"title": "Learning with Partially Absorbing Random Walks", "book": "Advances in Neural Information Processing Systems", "page_first": 3077, "page_last": 3085, "abstract": "We propose a novel stochastic process that is with probability $\\alpha_i$ being absorbed at current state $i$, and with probability $1-\\alpha_i$ follows a random edge out of it. We analyze its properties and show its potential for exploring graph structures. We prove that under proper absorption rates, a random walk starting from a set $\\mathcal{S}$ of low conductance will be mostly absorbed in $\\mathcal{S}$. Moreover, the absorption probabilities vary slowly inside $\\mathcal{S}$, while dropping sharply outside $\\mathcal{S}$, thus implementing the desirable cluster assumption for graph-based learning. Remarkably, the partially absorbing process unifies many popular models arising in a variety of contexts, provides new insights into them, and makes it possible for transferring findings from one paradigm to another. Simulation results demonstrate its promising applications in graph-based learning.", "full_text": "Learning with Partially Absorbing Random Walks\n\nXiao-Ming Wu1, Zhenguo Li1, Anthony Man-Cho So3, John Wright1 and Shih-Fu Chang1,2\n\n1Department of Electrical Engineering, Columbia University\n\n2Department of Computer Science, Columbia University\n\n3Department of SEEM, The Chinese University of Hong Kong\n\n{xmwu, zgli, johnwright, sfchang}@ee.columbia.edu, manchoso@se.cuhk.edu.hk\n\nAbstract\n\nWe propose a novel stochastic process that is with probability \u03b1i being absorbed\nat current state i, and with probability 1 \u2212 \u03b1i follows a random edge out of it.\nWe analyze its properties and show its potential for exploring graph structures.\nWe prove that under proper absorption rates, a random walk starting from a set\nS of low conductance will be mostly absorbed in S. Moreover, the absorption\nprobabilities vary slowly inside S, while dropping sharply outside, thus imple-\nmenting the desirable cluster assumption for graph-based learning. Remarkably,\nthe partially absorbing process uni\ufb01es many popular models arising in a variety\nof contexts, provides new insights into them, and makes it possible for transfer-\nring \ufb01ndings from one paradigm to another. Simulation results demonstrate its\npromising applications in retrieval and classi\ufb01cation.\n\n1 Introduction\n\nRandom walks have been widely used for graph-based learning, leading to a variety of models in-\ncluding PageRank [14] for web page ranking, hitting and commute times [8] for similarity measure\nbetween vertices, harmonic functions [20] for semi-supervised learning, diffusion maps [7] for di-\nmensionality reduction, and normalized cuts [12] for clustering. In graph-based learning one often\nadopts the cluster assumption, which states that the semantics usually vary smoothly for vertices\nwithin regions of high density [17], and suggests to place the prediction boundary in regions of\nlow density [5]. It is thus interesting to ask how the cluster assumption can be realized in terms of\nrandom walks.\n\nAlthough a random walk appears to explore the graph globally, it converges to a stationary distribu-\ntion determined solely by vertex degrees regardless of the starting points, a phenomenon well known\nas the mixing of random walks [11]. This causes some random walk approaches intended to capture\nnon-local graph structures to fail, especially when the underlying graph is well connected, i.e., the\nrandom walk has a large mixing rate. For example, it was recently proven in [16] that under some\nmild conditions the hitting and commute times on large graphs do not take into account the global\nstructure of the graph at all, despite the fact that they have integrated all the relevant paths on the\ngraph. It is also shown in [13] that the \u201charmonic\u201d walks [20] in high-dimensional spaces converge\nto a constant distribution as the data size approaches in\ufb01nity, which is undesirable for classi\ufb01cation\nand regression. These \ufb01ndings show that intuitions regarding random walks can sometimes be mis-\nleading, and should be taken with caution. A natural question is: can we design a random walk\nwhich implements the cluster assumption with some guarantees?\n\nIn this paper, we propose partially absorbing random walks (PARWs), a novel random walk mod-\nel whose properties can be analyzed theoretically. In PARWs, a random walk is with probability\n\u03b1i being absorbed at current state i, and with probability 1 \u2212 \u03b1i follows a random edge out of it.\nPARWs are guaranteed to implement the cluster assumption in the sense that under proper absorp-\n\n1\n\n\fkkp\n\nk\n\n(b)\n\niip\n\ni\n\njjp\n\nj\n\n'i\n\ni\n\n'k\n\nk\n\n(c)\n\n'j\n\nj\n\nt =\n\n0\n\n1t =\n(a)\n\nt =\n\n2\n\nFigure 1: A partially absorbing random walk. (a) A \ufb02ow perspective (see text). (b) A second-order\nMarkov chain. (c) An equivalent standard Markov chain with additional sinks.\n\ntion rates, a random walk starting from a set S of low conductance will be mostly absorbed in S.\nFurthermore, we show that by setting the absorption rates, the absorption probabilities can vary s-\nlowly inside S, while dropping sharply outside S. This approximately piecewise constant property\nmakes PARWs highly desirable and robust for a variety of learning tasks including ranking, clus-\ntering, and classi\ufb01cation, as demonstrated in Section 4. More interestingly, it turns out that many\nexisting models including PageRank, hitting and commute times, and label propagation algorithms\nin semi-supervised learning, can be uni\ufb01ed or related in PARWs, which brings at least two bene\ufb01ts.\nOn one hand, our theoretical analysis sheds some light on the understanding of existing models; on\nthe other hand, it enables transferring \ufb01ndings among different paradigms. We present our model in\nSection 2, analyze a special case of it in Section 3, and show simulation results in Section 4. Section\n5 concludes the paper. Most of our proofs are included in supplementary material.\n\n2 Partially Absorbing Random Walks\n\nLet us consider a simple diffusion process illustrated in Fig. 1(a). At the beginning, a unit \ufb02ow (blue)\nis injected to the graph at a selected vertex. After one step, some of the \ufb02ow (red) is \u201cstored\u201d at the\nvertex while the rest (blue) propagates to its neighbors. Whenever the \ufb02ow passes a vertex, some\nfraction of it is retained at that vertex. As this process continues, the amount of \ufb02ow stored in each\nvertex will accumulate and there will be less and less \ufb02ow left running on the graph. After a certain\nnumber of steps, there will be almost no \ufb02ow left running and the \ufb02ow stored will nearly sum up to\n1. The above diffusion process can be made precise in terms of random walks, as shown below.\n\nConsider a discrete-time stochastic process X = {Xt : t \u2265 0} on the state space N = {1, 2, . . . , n},\nwhere the initial state X0 is given, say X0 = i, the next state X1 is determined by the transition\nprobability P(X1 = j|X0 = i) = pij, and the subsequent states are determined by the transition\nprobabilities\n\nP(Xt+2 = j|Xt+1 = i, Xt = k) =( 1,\n\ni = j, i = k,\ni 6= j, i = k,\n\n0,\nP(Xt+2 = j|Xt+1 = i) = pij,\n\n(1)\n\ni 6= k,\n\nwhere t \u2265 0. Note that the process X is time homogeneous, i.e., the transition probabilities in (1)\nare independent of t. In other words, if the previous and current states are the same, the process\nwill remain in the current state forever. Otherwise, the next state is conditionally independent of the\nprevious state given the current state, i.e., the process behaves like a usual random walk.\n\nTo illustrate the above construction, consider Fig. 1(b). Starting from state i, there is some probabil-\nity pii that the process will stay at i in the next step; and once it stays, the process will be absorbed\ninto state i. Hence, we shall call the above process a partially absorbing random walk (PARW),\nwhere pii is the absorption rate of state i. If 0 < pii < 1, then we say that i is a partially absorbing\nstate. If pii = 1, then we say that i is a fully absorbing state. Finally, if pii = 0, then we say that i\nis a transient state. Note that if pii \u2208 {0, 1} for every state i \u2208 N , then the above process reduces to\na standard Markov chain [9].\n\nA PARW is a second-order Markov chain completely speci\ufb01ed by its \ufb01rst order transition probabil-\nities {pij}. One can observe that any PARW can be realized as a standard Markov chain by adding\na sink (fully absorbing state) to each vertex in the graph, as illustrated in Fig. 1(c). The transition\n\n2\n\n\fprobability from i to its sink i\u2032 equals the absorption rate pii in PARWs. One may also notice that\nthe construction of PARWs can be generalized to the m-th order, i.e., the process is absorbed at a\nstate only after it has stayed at that state for m-consecutive steps. However, it can be shown that any\nm-th order PARW can be realized by a second-order PARW. We will not elaborate on this due to\nspace constraints.\n\n2.1 PARWs on Graphs\n\nLet G = (V, W ) be an undirected weighted graph, where V is a set of n vertices and W = [wij ] \u2208\nRn\u00d7n is a symmetric non-negative matrix of pairwise af\ufb01nities among vertices. We assume G is\n\nconnected. Let D = diag(d1, d2, . . . , dn) with di = Pj wij as the degree of vertex i, and de\ufb01ne\nthe Laplacian of G by L = D \u2212 W [6]. Denote by d(S) :=Pi\u2208S di the volume of a subset S \u2286 V\n\nof vertices. Let \u03bb1, \u03bb2, . . . , \u03bbn \u2265 0 be arbitrary, and set \u039b = diag(\u03bb1, \u03bb2, . . . , \u03bbn). Suppose that\nwe de\ufb01ne the \ufb01rst order transition probabilities of a PARW by\n\npij =( \u03bbi\n\nwij\n\n\u03bbi+di\n\n\u03bbi+di\n\n,\n\n,\n\ni = j,\n\ni 6= j.\n\n(2)\n\nThen, we see that state i is an absorbing state (either partially or fully) when \u03bbi > 0, and is a transient\nstate when \u03bbi = 0. In particular, the matrix \u039b acts like a regularizer that controls the absorption rate\nof each state, i.e., the larger \u03bbi, the larger pii. In the sequel, we refer to \u039b as the regularizer matrix.\nAbsorption Probabilities. We are interested in the probability aij that a random walk starting from\nstate i, is absorbed at state j in any \ufb01nite number of steps. Let A = [aij ] \u2208 Rn\u00d7n be the matrix of\nabsorption probabilities. The following theorem shows that A has a closed-form.\nTheorem 2.1. Suppose \u03bbi > 0 for some i. Then A = (\u039b + L)\u22121\u039b.\n\nProof. Since \u03bbi > 0 for some i, the matrix \u039b + L is positive de\ufb01nite and hence non-singular.\nMoreover, the matrix \u039b + D is non-singular, since D is non-singular. Thus, the matrix I \u2212 (\u039b +\nD)\u22121W = (\u039b + D)\u22121(\u039b + L) is also non-singular. Now, observe that the absorbing probabilities\n{aij} satisfy the following equations:\n\naii =\n\n\u03bbi\n\n\u03bbi + di\n\naij = Xk6=i\n\nwik\n\n\u03bbi + di\n\nwij\n\n\u03bbi + di\n\naji,\n\n\u00d7 1 +Xj6=i\n\nakj, i 6= j.\n\n(3)\n\n(4)\n\nUpon writing equations (3) and (4) in matrix form, we have (I \u2212 (\u039b + D)\u22121W )A = (\u039b + D)\u22121\u039b,\nwhence A = (I \u2212 (\u039b + D)\u22121W )\u22121(\u039b + D)\u22121\u039b = (\u039b + D \u2212 W )\u22121\u039b = (\u039b + L)\u22121\u039b.\n\nThe following result con\ufb01rms that A is indeed a probability matrix.\nProposition 2.1. Suppose \u03bbi > 0 for some i. Then A is a non-negative matrix with each row\nsumming up to 1.\n\nBy Proposition 2.1,Pk ajk = 1 for any j. This means that a PARW starting from any vertex will\n\neventually be absorbed, provided that there is at least one absorbing state in the state space.\n\n2.2 Limits of Absorption Probabilities\n\nBy Theorem 2.1, we see that the absorption probabilities (A) are governed by both the structure\nof the graph (L) and the regularizer matrix (\u039b). It would be interesting to see how A varies with\n\u039b, particularly when \u03bbi\u2019s become small which allows the \ufb02ow to propagate suf\ufb01ciently (Fig. 1(a)).\nThe following result shows that as \u039b (\u03bbi\u2019s) vanishes, each row of A converges to a distribution\nproportional to (\u03bb1, \u03bb2, . . . , \u03bbn), regardless of graph structure.\nTheorem 2.2. Suppose \u03bbi > 0 for all i. Then\n\n\u22a4\n(\u03b1\u039b + L)\u22121\u03b1\u039b = 1\u00af\u03bb\n\n,\n\nlim\n\u03b1\u21920+\n\n(5)\n\nwhere (\u00af\u03bb)i = \u03bbi/(Pn\n\nj=1 \u03bbj). In particular, lim\u03b1\u21920+(\u03b1I + L)\u22121\u03b1I = 1\n\nn\n\n11\u22a4.\n\n3\n\n\fTheorem 2.2 tells us that with \u039b = \u03b1I and as \u03b1 \u2192 0 a PARW will converge to the constant\ndistribution 1/n, regardless of the starting vertex. At \ufb01rst glance, this limit seems meaningless.\nHowever, the following lemma will show that it actually has interesting connections with L+, the\npseudo-inverse of the graph Laplacian, a matrix that is widely studied and proven useful for many\nlearning tasks including recommendation and clustering [8].\nProposition 2.2. Suppose \u039b = \u03b1I and denote A\u03b1 := (\u039b + L)\u22121\u039b = (\u03b1I + L)\u22121\u03b1. Then,\n\nA\u03b1 \u2212 1\nn\n\u03b1\n\nlim\n\u03b1\u21920\n\n11\u22a4\n\n= L+.\n\n(6)\n\nProposition 2.2 gives a novel probabilistic interpretation of L+. Note that by Theorem 2.2, A0 :=\nlim\u03b1\u21920 A\u03b1 = 1\n11\u22a4. Thus L+ is the derivative of A\u03b1 w.r.t. \u03b1 at \u03b1 = 0, implying that L+ re\ufb02ects\nn\nthe variation of absorption probabilities when the absorption rate is very small. By (6), we see that\nranking by L+ is essentially the same as ranking by A\u03b1, when \u03b1 is suf\ufb01ciently small.\n\n2.3 Relations with Popular Ranking and Classi\ufb01cation Models\n\nRelations with PageRank Vectors. Suppose \u03bbj > 0 for all j. Let a be the absorption probability\nvector of a PARW starting from vertex i. Denote by s the indicator vector of i, i.e., s(i) = 1 and\ns(j) = 0 for j 6= i. Then a\u22a4 = s\u22a4(\u039b + L)\u22121\u039b, which can be rewritten as\n\na\u22a4 = s\u22a4(\u039b + D)\u22121\u039b + a\u22a4\u039b\u22121W (\u039b + D)\u22121\u039b.\n\n(7)\n\nBy letting \u039b = \u03b2\n1\u2212\u03b2 D, we have a\u22a4 = \u03b2s\u22a4 + (1 \u2212 \u03b2)a\u22a4D\u22121W, which is exactly the equilibrium\nequation for personalized PageRank [14]. Note that \u03b2 is often referred to as the \u201cteleportation\u201d\nprobability in PageRank. This shows that personalized PageRank is a special case of PARWs with\nabsorption rates pii = \u03bbi\nRelations with Hitting and Commute Times. The hitting time Hij is the expected time that it\ntakes a random walk starting from i to \ufb01rst arrive at j, and the commute time Cij is the expected\ntime it takes a random walk starting from i to travel to j and back to i, which can be computed as\n\n= \u03b2.\n\n\u03bbi+di\n\nHij = d(G)(L+\n\njj \u2212 L+\n\nij), Cij = Hij + Hji = d(G)(L+\n\nii + L+\n\njj \u2212 2L+\n\nij),\n\n(8)\n\nii + A\u03b1\n\njj \u2212 A\u03b1\n\njj \u2212 2A\u03b1\n\nwhere d(G) := Pi di denotes the volume of the graph. By (6), when \u039b = \u03b1I and \u03b1 is suf\ufb01ciently\n\nij or\nsmall, ranking with Hij or Cij (say, with respect to i) is the same as ranking by A\u03b1\nij respectively. This appears to be not particularly meaningful because the term A\u03b1\nA\u03b1\njj\nis the self-absorption probability that does not contain any essential information with the starting\nvertex i. Accordingly, it should not be included as part of the ranking function with respect to i.\nThis argument is also supported in a recent study by [16], where the hitting and commute times are\nshown to be dominated by the inverse of degrees of vertices. In other words, they do not take into\naccount the graph structure at all. A remedy they propose is to throw away the diagonal terms of\nL+ and only use the off-diagonal terms. This happens to suggest using absorption probabilities for\nranking and as similarity measure, because when \u03b1 is suf\ufb01ciently small, ranking by the off-diagonal\nij, i.e., the absorption probability of starting\nterms of L+ is essentially the same as ranking by A\u03b1\nfrom i and being absorbed at j. Our theoretical analysis in Section 3 and the simulation results in\nSection 4 further con\ufb01rm this argument.\nRelations with Semi-supervised Learning. Interestingly, many label propagation algorithms in\nsemi-supervised learning can be cast in PARWs. The harmonic function method [20] is a PARW\nwhen setting \u03bbi = \u221e (absorption rate 1) for the labeled vertices while \u03bbi = 0 (absorption rate 0) for\nthe unlabeled. In [19] the authors have made this interpretation in terms of absorbing random walks,\nwhere a random walk arriving at an absorbing state will stay there forever. PARWs can be viewed\nas an extension of absorbing random walks. The regularized harmonic function method [5] is also a\nPARW when setting \u03bbi = \u03b1 for the labeled vertices while \u03bbi = 0 for the unlabeled. The consistency\nmethod [17], if using un-normalized Laplacian instead of normalized Laplacian, is a PARW with\n\u039b = \u03b1I. Our analysis in this paper reveals several nice properties of this case (Section 3). A variant\nof this method is a PARW with \u039b = \u03b1D, which is the same as PageRank as shown above. If we\nadd an additional sink to the graph, a variant of harmonic function method [10] and a variant of the\nregularized harmonic function method [3] can all be included as instances of PARWs. We omit the\ndetails here due to space constraints.\n\n4\n\n\fBene\ufb01ts of a Unifying View. We have shown that PARWs can unify or relate many models from\ndifferent contexts. This brings at least two bene\ufb01ts. First, it sheds some light on existing models. For\ninstance, hitting and commute times are not suitable for ranking given its interpretation in absorption\nprobabilities, as discussed above. In the next section, we will show that a special case of PARWs is\nbetter suited for implementing the cluster assumption for graph-based learning. Second, a unifying\nview builds bridges between different paradigms thus making it easier to transfer \ufb01ndings between\nthem. For example, it has been shown in [2, 4] that approximate personalized PageRank vectors can\nbe computed in O(1/\u01eb) iterations, where \u01eb is a precision tolerance parameter. We indicate here that\nsuch a technique is also applicable to PARWs due to PARWs\u2019s generalizing nature. Consequently,\nmost models included in PARWs can be substantially accelerated using the same technique.\n\n3 PARWs with Graph Conductance\n\nIn this section, we present results on the properties of the absorption probability vector ai obtained\nby a PARW starting from vertex i (i.e., a\u22a4\nis the row i of A). We show that properties of ai\ni\nrelate closely to the connectivity between i and the rest of graph, which can be captured by the\nconductance of the cluster S where i belongs. We also \ufb01nd that properties of ai depend on the\nsetting of absorption rates. Our key results can be summarized as follows. In general, the probability\nmass of ai is mostly absorbed by S. Under proper absorption rates, ai can vary slowly within S\nwhile dropping sharply outside S. Such properties are highly desirable for learning tasks such as\nranking, clustering, and classi\ufb01cation.\n\nThe conductance of a subset S \u2282 V of vertices is de\ufb01ned as \u03a6(S) =\n\nw(S, \u00afS)\n\nmin(d(S),d( \u00afS))\n\n, where\n\nw(S, \u00afS) := P(i,j)\u2208e(S, \u00afS) wij is the cut between S and its complement \u00afS [6]. We denote the\n\nindicator vector of S by \u03c7S such that \u03c7S (i) = 1 if i \u2208 S and \u03c7S(i) = 0 otherwise; and denote\nthe stationary distribution w.r.t. S by \u03c0S such that \u03c0S(i) = di/d(S) if i \u2208 S and \u03c0S (i) = 0\notherwise. In terms of the conductance of S, the following theorem gives an upper bound on the\nexpected probability mass escaped from S if the distribution of the starting vertex is \u03c0S.\nTheorem 3.1. Let S be any set of vertices satisfying d(S) \u2264 1\n\u03b32 = maxi\u2208 \u00afS\n\n2 d(G). Let \u03b31 = mini\u2208S\n\n. Then,\n\n\u03bbi\ndi\n\nand\n\n\u03bbi\ndi\n\n\u03c0\u22a4\n\nS A\u03c7 \u00afS \u2264\n\n\u03b32\n\n1 + \u03b31\n\n1 + \u03b32\n\n\u03b32\n1\n\n\u03a6(S).\n\n(9)\n\nTheorem 3.1 shows that most of the probability mass will be absorbed in S, provided that S is of\nsmall conductance and the random walk starts from S according to \u03c0S. In other words, a PARW\nwill be trapped inside the cluster1 from where it starts, as desired. To identify the entire cluster, what\nis more desirable would be that the absorption probabilities vary slowly within the cluster while\ndropping sharply outside. As such, the cluster can be identi\ufb01ed by detecting the sharp drop. We\nshow below that such property can be achieved by setting appropriate absorption rates at vertices.\n\n3.1 PARWs with \u039b = \u03b1I\n\nWe will prove that the choice of \u039b = \u03b1I can ful\ufb01ll the above goal. Before presenting theoretical\nanalysis, let us discuss the intuition behind it from both \ufb02ow (Fig. 1(a)) and random walk perspec-\ntives. To vary slowly within the cluster, the \ufb02ow needs to be distributed evenly within it; while to\ndrop sharply outside, the \ufb02ow must be prevented from escaping. This means that the absorption\nrates should be small in the interior but large near the boundary area of the cluster. Setting \u039b = \u03b1I\nachieves this. It corresponds to the absorption rates pii = \u03bbi\n, which decrease monoton-\nically with di. Since the degrees of vertices are usually relatively large in the interior of the cluster\ndue to denser connections, and small near its boundary area (Fig. 2(a)), the absorption rates are\ntherefore much larger at its boundary than in its interior (Fig. 2(b)). State differently, a random walk\nmay move freely inside the cluster, but it will get absorbed with high probability when traveling\nnear the cluster\u2019s boundary. In this way, the absorption rates set up a bounding \u201cwall\u201d around the\ncluster to prevent the random walk from escaping, leading to an absorption probability vector that\n\n= \u03b1\n\n\u03bbi+di\n\n\u03b1+di\n\n1A cluster is understood as a subset of vertices of small conductance.\n\n5\n\n\f4x 10\u22123\n\n4x 10\u22123\n\n2\n\n0\n0\n\n2\n\n0\n0\n\n300\n\n600\n\n900\n\n(c)\n\n300\n\n600\n\n900\n\n(d)\n\n(a)\n\n(b)\n\nFigure 2: Absorption rates and absorption probabilities. (a) A data set of three Gaussians with the\ndegrees of vertices in the underlying graph shown (see Section 4 for the descriptions of the data\nand graph construction). A starting vertex is denoted in black circle. (b\u2013c) Absorption rates and\nabsorption probabilities for \u039b = \u03b1I (\u03b1 = 10\u22123). (d) Sorted absorption probabilities of (c). For\nillustration purpose, in (a\u2013b), the degrees of vertices and the absorption rates have been properly\nscaled, and in (c), the data are arranged such that points within each Gaussian appear consecutively.\n\nvaries slowly within the cluster while dropping sharply outside (Figs. 2(c\u2013d)), thus implementing\nthe cluster assumption. We make these arguments precise below.\n\nIt is worth pointing out that a PARW with \u039b = \u03b1I is symmetric, i.e., the absorption probability of\nstarting from i and absorbed at j is equal to the probability of starting from j and absorbed at i. For\nsimplicity, we use the abbreviated notation a to denote ai, the absorption probability vector for the\nPARW starting from vertex i. By (3) and the symmetry property, we immediately see that a has the\nfollowing \u201charmonic\u201d property:\n\na(i) =\n\n\u03bbi\n\n\u03bbi + di\n\nwik\n\n\u03bbi + di\n\na(k),\n\n+Xk6=i\n\na(j) =Xk6=j\n\nwjk\n\n\u03bbj + dj\n\na(k), j 6= i.\n\n(10)\n\nWe will use this property to prove some interesting results. Another desirable property one should\nnotice for this PARW is that the starting vertex always has the largest absorption probability, as\nshown by the following lemma.\nLemma 3.2. Given \u039b = \u03b1I, then aii > aij for any i 6= j.\n\nBy Lemma 3.2 and without loss of generality, we assume the vertices are sorted so that a(1) >\na(2) \u2265 \u00b7 \u00b7 \u00b7 \u2265 a(n), where vertex 1 is the starting vertex. Let Sk be the set of vertices {1, . . . , k}.\nDenote e(Si, Sj ) as the set of edges between Si and Sj.\nThe following theorem quanti\ufb01es the drop of the absorption probabilities between Sk and \u00afSk.\nTheorem 3.3. For every S \u2208 {Sk | k = 1, 2, . . . , n},\n\nX(u,v)\u2208e(S, \u00afS)\n\nwuv (a(u) \u2212 a(v)) = \u03b1 1 \u2212Xk\u2208S\n\na(k)! .\n\n(11)\n\nTheorem 3.3 shows that the weighted difference in absorption probabilities between Sk and \u00afSk is\n\n\u03b1(cid:16)1 \u2212Pk\n\nj=1\n\na(j)(cid:17), implying that it drops slowly when \u03b1 is small and as k increases, as expected.\n\nNext we show the variation of absorption probabilities with graph conductance. Without loss of\ngenerality, we consider sets Sj where d(Sj) \u2264 1\nThe following theorem says that a(j +1) will drop little from a(j) if the set Sj has high conductance\nor if the vertex j is far away from the starting vertex 1 (i.e., j \u226b 1).\nLemma 3.4. If \u03a6(Sj ) = \u03c6, then\n\n2 d(G).\n\na(j + 1) \u2265 a(j) \u2212\n\n\u03b1(cid:16)1 \u2212Pj\n\nk=1\n\u03c6d(Sj )\n\na(k)(cid:17)\n\n.\n\n(12)\n\nThe above result can be extended to describe the drop in a much longer range, as stated in the\nfollowing theorem.\n\n6\n\n\f0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n0\n\n0.01\n\n0.005\n\n300\n\n600\n\n900\n\n0\n0\n\n300\n\n600\n\n900\n\n(a)\n\n(b)\n\n2x 10\u22123\n\n1\n\n0\n0\n\n1.12x 10\u22123\n\n1.1\n\n1.08\n\n300\n\n600\n(c)\n\n900\n\n1.06\n0\n\n300\n\n600\n\n900\n\n(d)\n\n0.03\n\n0.02\n\n0.01\n\n0\n0\n\n6x 10\u22123\n\n3x 10\u22123\n\n4\n\n2\n\n0\n0\n\n300\n\n600\n\n900\n\n(g)\n\n2\n\n1\n\n0\n0\n\n300\n\n600\n(h)\n\n900\n\n300\n\n600\n(i)\n\n900\n\n300\n\n600\n\n900\n\n(f)\n\n1.1114x 10\u22123\n\n1.1112\n\n1.111\n\n1.1108\n0\n\n3x 10\u22123\n\n2\n\n1\n\n0\n0\n\n300\n\n600\n\n900\n\n(e)\n\n300\n\n600\n\n900\n\n(j)\n\nFigure 3: Absorption probabilities on the three Gaussians in Fig. 2(a) with the starting vertex\ndenoted in black circle.\n(a\u2013e) \u039b = \u03b1I, \u03b1 = 100, 10\u22122, 10\u22124, 10\u22126, 10\u22128; (f\u2013j) \u039b = \u03b1D,\n\u03b1 = 100, 10\u22122, 10\u22124, 10\u22126, 10\u22128. For illustration purpose, the data are arranged such that points\nwithin each Gaussian appear consecutively, as in Fig. 2(c).\n\nTable 1: Ranking results (MAP) on USPS\n\nDigits\n\u039b = \u03b1I\nPageRank\n\nManifold Ranking\nEuclidean Distance\n\n0\n\n.981\n.886\n.957\n.640\n\n1\n\n.988\n.972\n.987\n.980\n\n2\n\n.876\n.608\n.827\n.318\n\n3\n\n.893\n.764\n.827\n.499\n\n4\n\n.646\n.488\n.467\n.337\n\n5\n\n.778\n.568\n.630\n.294\n\n6\n\n.940\n.837\n.917\n.548\n\n7\n\n.919\n.825\n.822\n.620\n\n8\n\n.746\n.626\n.675\n.368\n\n9\n\n.730\n.702\n.719\n.480\n\nAll\n.850\n.728\n.783\n.508\n\nTheorem 3.5. If \u03a6(Sj) \u2265 2\u03c6, then there exists a k > j such that\n\nd(Sk) \u2265 (1 + \u03c6)d(Sj )\n\nand\n\na(k) \u2265 a(j) \u2212\n\n\u03b1(cid:16)1 \u2212Pj\n\nk=1\n\u03c6d(Sj )\n\na(k)(cid:17)\n\n.\n\nTheorem 3.5 tells us that if the set Sj has high conductance, then there will be a set Sk much larger\nthan Sj where the absorption probability a(k) remains large. In other words, a(k) will not drop\nmuch if Sj is closely connected with the rest of graph. Combining Theorems 3.3, 3.5, and 3.1, we\nsee that the absorption probability vector of the PARW with \u039b = \u03b1I has the nice property of varying\nslowly within the cluster while dropping sharply outside.\n\nWe remark that similar analyses have been conducted in [1, 2] on personalized PageRank, for the\nlocal clustering problem [15] whose goal is to \ufb01nd a local cut of low conductance near a speci\ufb01ed\nstarting vertex. As shown in Section 2, personalized PageRank is a special case of PARWs with\n\u039b = \u03b1D = \u03b2\n1\u2212\u03b2 D, which corresponds to setting the same absorption rate pii = \u03b2 at each vertex.\nThis setting does not take advantage of the cluster assumption. Indeed, despite the signi\ufb01cant cluster\nstructure in the three Gaussians (Fig. 2), no clear drop emerges by varying \u03b2 (Section 4). This\nexplains the \u201cheuristic\u201d used in [1, 2] where the personalized PageRank vector is divided by the\ndegrees of vertices to generate a sharp drop. In contrast, our choice of \u039b = \u03b1I appears to be more\njusti\ufb01ed, without the need of such post-processing while retaining a probabilistic foundation.\n\n4 Simulation\n\nIn this section, we demonstrate our theoretical results on both synthetic and real data. For each data\nset, a weighted k-NN graph is constructed with k = 20. The similarity between vertices i and j is\ncomputed as wij = exp(\u2212d2\nij/\u03c3) if i is within j\u2019s k nearest neighbors or vice versa, and wij = 0\notherwise (wii = 0), where \u03c3 = 0.2 \u00d7 r and r denotes the average square distance between each\npoint to its 20th nearest neighbor.\nThe \ufb01rst experiment is to examine the absorption probabilities when varying absorption rates. We\nuse the synthetic three Gaussians in Fig. 2(a), which consists of 900 points from three Gaussians,\nwith 300 in each. Fig. 3 compares the cases of \u039b = \u03b1I and \u039b = \u03b1D (PageRank). We can\n\n7\n\n\fTable 2: Classi\ufb01cation accuracy on USPS\n\nHMN\n\nLGC\n\n\u039b = \u03b1D\n\n\u039b = \u03b1I\n\n.782 \u00b1 .068\n\n.792 \u00b1 .062\n\n.787 \u00b1 .048\n\n.881 \u00b1 .039\n\ndraw several observations. For \u039b = \u03b1I, when \u03b1 is large, most probability mass is absorbed in\nthe cluster of the starting vertex (Fig. 3(a)). As it becomes appropriately small, the probability mass\ndistributes evenly within the cluster, and a sharp drop emerges (Fig. 3(b)). As \u03b1 \u2192 0, the probability\nmass distributes more evenly within each cluster and also on the entire graph (Figs. 3(c\u2013e)), but the\ndrops between clusters are still quite signi\ufb01cant. In contrast, for \u039b = \u03b1D, no signi\ufb01cant drops\nshow for all \u03b1\u2019s (Figs. 3(f\u2013j)). This is due to the uniform absorption rates on the graph, which\nmakes the \ufb02ow favor vertices with denser connections (i.e., of large degrees). These observations\nsupport the theoretical arguments in Section 3 for PARWs with \u039b = \u03b1I and suggest its robustness\nin distinguishing between different clusters.\n\nThe second experiment is to test the potential of PARWs for information retrieval. We compare\nPARWs with \u039b = \u03b1I to PageRank (i.e., PARWs with \u039b = \u03b1D), Manifold Ranking [18], and\nthe baseline using Euclidean distance. For parameter selection, we use \u03b1 = 10\u22126 for \u039b = \u03b1I\nand \u03b2 = 0.15 for PageRank (see Section 2.3) as suggested in [14]. The regularization parameter\nin Manifold Ranking is set to 0.99, following [18]. The image benchmark USPS2 is used for this\nexperiment, which contains 9298 images of handwritten digits from 0 to 9 of size 16 \u00d7 16, with\n1553, 1269, 929, 824, 852, 716, 834, 792, 708, and 821 instances of each digit respectively. Each\ninstance is used as a query and the mean average precision (MAP) is reported. The results are shown\nin Table 1. We see that the PARW with \u039b = \u03b1I consistently gives best results for individual digits\nas well as the entire data set.\n\nIn the last experiment, we test PARWs on classi\ufb01cation/semi-supervised learning, also on USPS\nwith all 9298 images. We randomly sample 20 instances as labeled data and make sure there is\nat least one label for each class. For PARWs, we classify each unlabeled instance u to the class\nof the labeled vertex v where u is most likely to be absorbed, i.e., v = arg maxi\u2208L aui where L\ndenotes the labeled data and aui is the absorption probability. We compare PARWs with \u039b = \u03b1I\n(\u03b1 = 10\u22126) and \u039b = \u03b1D (\u03b2 = 0.15) to the harmonic function method (HMN) [20] coupled\nwith class mass normalization (CMN) and the local and global consistency (LGC) method [17]. No\nparameter in HMN is required, and the regularization parameter in LGC is set to 0.99 following [17].\nThe classi\ufb01cation accuracy averaged over 1000 runs is shown in Table 2. Again, it con\ufb01rms the\nsuperior performance of the PARW with \u039b = \u03b1I.\nIn the second and third experiments, we also tried other parameter settings for methods where ap-\npropriate. We found that the performance of PARWs with \u039b = \u03b1I is quite stable with small \u03b1, and\nvarying parameters in other methods did not lead to signi\ufb01cantly better results, which validates our\nprevious arguments.\n\n5 Conclusions\n\nWe have presented partially absorbing random walks (PARWs), a novel stochastic process generaliz-\ning ordinary random walks. Surprisingly, it has been shown to unify or relate many popular existing\nmodels and provide new insights. Moreover, a new algorithm developed from PARWs has been\ntheoretically shown to be able to reveal cluster structure under the cluster assumption. Simulation\nresults have con\ufb01rmed our theoretical analysis and suggested its potential for a variety of learning\ntasks including retrieval, clustering, and classi\ufb01cation. In future work, we plan to apply our model\nto real applications.\n\nAcknowledgements\n\nThis work is supported in part by Of\ufb01ce of Naval Research (ONR) grant #N00014-10-1-0242. The\nauthors would like to thank the anonymous reviewers for their insightful comments.\n\n2http://www-stat.stanford.edu/ tibs/ElemStatLearn/\n\n8\n\n\fReferences\n\n[1] R. Andersen and F. Chung. Detecting sharp drops in pagerank and a simpli\ufb01ed local partition-\n\ning algorithm. Theory and Applications of Models of Computation, pages 1\u201312, 2007.\n\n[2] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In\n\nFOCS, pages 475\u2013486, 2006.\n\n[3] Y. Bengio, O. Delalleau, and N. Le Roux. Label propagation and quadratic criterion. Semi-\n\nsupervised learning, pages 193\u2013216, 2006.\n\n[4] P. Berkhin. Bookmark-coloring algorithm for personalized pagerank computing.\n\nMathematics, 3(1):41\u201362, 2006.\n\nInternet\n\n[5] O. Chapelle and A. Zien. Semi-supervised classi\ufb01cation by low density separation. In AIS-\n\nTATS, 2005.\n\n[6] F. Chung. Spectral Graph Theory. American Mathematical Society, 1997.\n[7] R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis,\n\n21(1):5\u201330, 2006.\n\n[8] F. Fouss, A. Pirotte, J. Renders, and M. Saerens. Random-walk computation of similarities be-\ntween nodes of a graph with application to collaborative recommendation. IEEE Transactions\non Knowledge and Data Engineering, 19(3):355\u2013369, 2007.\n\n[9] J. Kemeny and J. Snell. Finite markov chains. Springer, 1976.\n[10] B. Kveton, M. Valko, A. Rahimi, and L. Huang. Semisupervised learning with max-margin\n\ngraph cuts. In AISTATS, pages 421\u2013428, 2010.\n\n[11] L. Lov\u00b4asz and M. Simonovits. The mixing rate of markov chains, an isoperimetric inequality,\n\nand computing the volume. In FOCS, pages 346\u2013354, 1990.\n\n[12] M. Meila and J. Shi. A random walks view of spectral segmentation. In AISTATS, 2001.\n[13] B. Nadler, N. Srebro, and X. Zhou. Statistical analysis of semi-supervised learning: The limit\n\nof in\ufb01nite unlabelled data. In NIPS, pages 1330\u20131338, 2009.\n\n[14] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order\n\nto the web. 1999.\n\n[15] D. A. Spielman and S.-H. Teng. A local clustering algorithm for massive graphs and its appli-\n\ncation to nearly-linear time graph partitioning. CoRR, abs/0809.3232, 2008.\n\n[16] U. Von Luxburg, A. Radl, and M. Hein. Hitting and commute times in large graphs are often\n\nmisleading. Arxiv preprint arXiv:1003.1266, 2010.\n\n[17] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global\n\nconsistency. In NIPS, pages 595\u2013602, 2004.\n\n[18] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Sch\u00a8olkopf. Ranking on data manifolds.\n\nIn NIPS, 2004.\n\n[19] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation.\n\nTechnical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.\n\n[20] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\n\nharmonic functions. In ICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1410, "authors": [{"given_name": "Xiao-ming", "family_name": "Wu", "institution": null}, {"given_name": "Zhenguo", "family_name": "Li", "institution": null}, {"given_name": "Anthony", "family_name": "So", "institution": null}, {"given_name": "John", "family_name": "Wright", "institution": null}, {"given_name": "Shih-fu", "family_name": "Chang", "institution": null}]}