{"title": "Fast Bidirectional Probability Estimation in Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1423, "page_last": 1431, "abstract": "We develop a new bidirectional algorithm for estimating Markov chain multi-step transition probabilities: given a Markov chain, we want to estimate the probability of hitting a given target state in $\\ell$ steps after starting from a given source distribution. Given the target state $t$, we use a (reverse) local power iteration to construct an `expanded target distribution', which has the same mean as the quantity we want to estimate, but a smaller variance -- this can then be sampled efficiently by a Monte Carlo algorithm. Our method extends to any Markov chain on a discrete (finite or countable) state-space, and can be extended to compute functions of multi-step transition probabilities such as PageRank, graph diffusions, hitting/return times, etc. Our main result is that in `sparse' Markov Chains -- wherein the number of transitions between states is comparable to the number of states -- the running time of our algorithm for a uniform-random target node is order-wise smaller than Monte Carlo and power iteration based algorithms; in particular, our method can estimate a probability $p$ using only $O(1/\\sqrt{p})$ running time.", "full_text": "Fast Bidirectional Probability Estimation in Markov\n\nModels\n\nSiddhartha Banerjee \u2217\n\nsbanerjee@cornell.edu\n\nPeter Lofgren\u2020\n\nplofgren@cs.stanford.edu\n\nAbstract\n\nWe develop a new bidirectional algorithm for estimating Markov chain multi-step\ntransition probabilities: given a Markov chain, we want to estimate the proba-\nbility of hitting a given target state in (cid:96) steps after starting from a given source\ndistribution. Given the target state t, we use a (reverse) local power iteration to\nconstruct an \u2018expanded target distribution\u2019, which has the same mean as the quan-\ntity we want to estimate, but a smaller variance \u2013 this can then be sampled ef\ufb01-\nciently by a Monte Carlo algorithm. Our method extends to any Markov chain on\na discrete (\ufb01nite or countable) state-space, and can be extended to compute func-\ntions of multi-step transition probabilities such as PageRank, graph diffusions, hit-\nting/return times, etc. Our main result is that in \u2018sparse\u2019 Markov Chains \u2013 wherein\nthe number of transitions between states is comparable to the number of states \u2013\nthe running time of our algorithm for a uniform-random target node is order-wise\nsmaller than Monte Carlo and power iteration based algorithms; in particular, our\nmethod can estimate a probability p using only O(1/\n\np) running time.\n\n\u221a\n\n1\n\nIntroduction\n\nMarkov chains are one of the workhorses of stochastic modeling, \ufb01nding use across a variety of\napplications \u2013 MCMC algorithms for simulation and statistical inference; to compute network cen-\ntrality metrics for data mining applications; statistical physics; operations management models for\nreliability, inventory and supply chains, etc. In this paper, we consider a fundamental problem as-\nsociated with Markov chains, which we refer to as the multi-step transition probability estimation\n(or MSTP-estimation) problem: given a Markov Chain on state space S with transition matrix P ,\nan initial source distribution \u03c3 over S, a target state t \u2208 S and a \ufb01xed length (cid:96), we are interested in\ncomputing the (cid:96)-step transition probability from \u03c3 to t. Formally, we want to estimate:\n\n\u03c3[t] := (cid:104)\u03c3P (cid:96), et(cid:105) = \u03c3P (cid:96)eT\np(cid:96)\nt ,\n\n\u03c3[t] > \u03b4.\n\n(1)\nwhere et is the indicator vector of state t. A natural parametrization for the complexity of MSTP-\nestimation is in terms of the minimum transition probabilities we want to detect: given a desired\nminimum detection threshold \u03b4, we want algorithms that give estimates which guarantee small rela-\ntive error for any (\u03c3, t, (cid:96)) such that p(cid:96)\nParametrizing in terms of the minimum detection threshold \u03b4 can be thought of as benchmarking\nagainst a standard Monte Carlo algorithm, which estimates p(cid:96)\n\u03c3[t] by sampling independent (cid:96)-step\npaths starting from states sampled from \u03c3. An alternate technique for MSTP-estimation is based on\nlinear algebraic iterations, in particular, the (local) power iteration. We discuss these in more detail\nin Section 1.2. Crucially, however, both these techniques have a running time of \u2126(1/\u03b4) for testing\nif p(cid:96)\n\n\u03c3[t] > \u03b4 (cf. Section 1.2).\n\u2217Siddhartha Banerjee is an assistant professor at the School of Operations Research and Information Engi-\n\u2020Peter Lofgren is a graduate student in the Computer Science Department at Stanford (http://cs.\n\nneering at Cornell (http://people.orie.cornell.edu/sbanerjee).\n\nstanford.edu/people/plofgren/).\n\n1\n\n\f1.1 Our Results\n\nTo the best of our knowledge, our work gives the \ufb01rst bidirectional algorithm for MSTP-estimation\nwhich works for general discrete state-space Markov chains1. The algorithm we develop is very\nsimple, both in terms of implementation and analysis. Moreover, we prove that in many settings, it\nis order-wise faster than existing techniques.\nOur algorithm consists of two distinct forward and reverse components, which are executed sequen-\ntially. In brief, the two components proceed as follows:\n\u2022 Reverse-work: Starting from the target node t, we perform a sequence of reverse local power\n\u2022 Forward-work: We next sample a number of random walks of length (cid:96), starting from \u03c3 and\n\u03c3[t].\nThis full algorithm, which we refer to as the Bidirectional-MSTP estimator, is formalized in\nAlgorithm 2. It works for all countable-state Markov chains, giving the following accuracy result:\nTheorem 1 (For details, refer Section 2.3). Given any Markov chain P , source distribution \u03c3,\nterminal state t, length (cid:96), threshold \u03b4 and relative error \u0001, Bidirectional-MSTP (Algorithm 2)\n\ntransitioning according to P , and return the sum of residues on the walk as an estimate of p(cid:96)\n\niterations \u2013 in particular, we use the REVERSE-PUSH operation de\ufb01ned in Algorithm 1.\n\nreturns an unbiased estimate(cid:98)p(cid:96)\n(cid:12)(cid:12)(cid:98)p(cid:96)\n\n\u03c3[t] for p(cid:96)\n\u03c3[t] \u2212 p(cid:96)\n\n\u03c3[t], which, with high probability, satis\ufb01es:\n\n\u03c3[t](cid:12)(cid:12) < max(cid:8)\u0001p(cid:96)\n\n\u03c3[t], \u03b4(cid:9) .\n\nSince we dynamically adjust the number of REVERSE-PUSH operations to ensure that all residues\nare small, the proof of the above theorem follows from straightforward concentration bounds.\nSince Bidirectional-MSTP combines local power iteration and Monte Carlo techniques, a nat-\nural question is when the algorithm is faster than both. It is easy to to construct scenarios where the\nruntime of Bidirectional-MSTP is comparable to its two constituent algorithms \u2013 for example,\nif t has more than 1/\u03b4 in-neighbors. Surprisingly, however, we show that in sparse Markov chains\nand for typical target states, Bidirectional-MSTP is order-wise faster:\nTheorem 2 (For details, refer Section 2.3). Given any Markov chain P , source distribution \u03c3,\nlength (cid:96), threshold \u03b4 and desired accuracy \u0001; then for a uniform random choice of t \u2208 S, the\nd/\u03b4), where d is the average\n\nBidirectional-MSTP algorithm has a running time of (cid:101)O((cid:96)3/2\n\n(cid:113)\n\nnumber of neighbors of nodes in S.\n\n\u221a\n\nThus, for typical targets, we can estimate transition probabilities of order \u03b4 in time only O(1/\n\u03b4).\nNote that we do not need for every state that the number of neighboring states is small, but rather,\nthat they are small on average \u2013 for example, this is true in \u2018power-law\u2019 networks, where some\nnodes have very high degree, but the average degree is small. The proof of this result is based on a\nmodi\ufb01cation of an argument in [2] \u2013 refer Section 2.3 for details.\nEstimating transition probabilities to a target state is one of the fundamental primitives in Markov\nchain models \u2013 hence, we believe that our algorithm can prove useful in a variety of application\ndomains. In Section 3, we brie\ufb02y describe how to adapt our method for some of these applica-\ntions \u2013 estimating hitting/return times and stationary probabilities, extensions to non-homogenous\nMarkov chains (in particular, for estimating graph diffusions and heat kernels), connections to lo-\ncal algorithms and expansion testing. In addition, our MSTP-estimator could be useful in several\nother applications \u2013 estimating ruin probabilities in reliability models, buffer over\ufb02ows in queueing\nsystems, in statistical physics simulations, etc.\n\n1.2 Existing Approaches for MSTP-Estimation\n\nThere are two main techniques used for MSTP-estimation. The \ufb01rst is a natural Monte Carlo al-\ngorithm: we estimate p(cid:96)\n\u03c3[t] by sampling independent (cid:96)-step paths, each starting from a random\nstate sampled from \u03c3. A simple concentration argument shows that for a given value of \u03b4, we need\n\u03c3[t], irrespective of the choice of t, and the structure\n\n(cid:101)\u0398(1/\u03b4) samples to get an accurate estimate of p(cid:96)\n\n1Bidirectional estimators have been developed before for reversible Markov chains [1]; our method however\n\nis not only more general, but conceptually and operationally simpler than these techniques (cf. Section 1.2).\n\n2\n\n\f\u03c3[t] > \u03b4.\n\nof P . Note that this algorithm is agnostic of the terminal state t; it gives an accurate estimate for any\nt such that p(cid:96)\nOn the other hand, the problem also admits a natural linear algebraic solution, using the standard\npower iteration starting with \u03c3, or the reverse power iteration starting with et (which is obtained\nby re-writing Equation (1) as p(cid:96)\n\u03c3[t] := \u03c3(et(P T )(cid:96))T ). When the state space is large, performing a\ndirect power iteration is infeasible \u2013 however, there are localized versions of the power iteration that\nare still ef\ufb01cient. Such algorithms have been developed, among other applications, for PageRank\nestimation [3, 4] and for heat kernel estimation [5]. Although slow in the worst case 2, such local\nupdate algorithms are often fast in practice, as unlike Monte Carlo methods they exploit the local\nstructure of the chain. However even in sparse Markov chains and for a large fraction of target\nstates, their running time can be \u2126(1/\u03b4). For example, consider a random walk on a random d-\nregular graph and let \u03b4 = o(1/n) \u2013 then for (cid:96) \u223c logd(1/\u03b4), verifying p(cid:96)\n[t] > \u03b4 is equivalent to\nuncovering the entire logd(1/\u03b4) neighborhood of s. Since a large random d-regular graph is (whp)\nan expander, this neighborhood has \u2126(1/\u03b4) distinct nodes. Finally, note that as with Monte Carlo,\npower iterations can be adapted to either the source or terminal state, but not both.\nFor reversible Markov chains, one can get a bidirectional algorithms for estimating p(cid:96)\n[t] based on\nes\ncolliding random walks. For example, consider the problem of estimating length-2(cid:96) random walk\ntransition probabilities in a regular undirected graph G(V, E) on n vertices [1, 6]. The main idea\nis that to test if a random walk goes from s to t in 2(cid:96) steps with probability \u2265 \u03b4, we can generate\ntwo independent random walks of length (cid:96), starting from s and t respectively, and detect if they\nterminate at the same intermediate node. Suppose pw, qw are the probabilities that a length-(cid:96) walk\nfrom s and t respectively terminate at node w \u2013 then from the reversibility of the chain, we have that\nw\u2208V pwqw; this is also the collision probability. The critical observation is that if we\np2(cid:96)\n\n\u03c3 [t] = (cid:80)\ngenerate(cid:112)1/\u03b4 walks from s and t, then we get 1/\u03b4 potential collisions, which is suf\ufb01cient to detect\n\nes\n\n\u03c3 [t] > \u03b4. This argument forms the basis of the birthday-paradox, and similar techniques used\nif p2(cid:96)\nin a variety of estimation problems (eg., see [7]). Showing concentration for this estimator is tricky\nas the samples are not independent; moreover, to control the variance of the samples, the algorithms\noften need to separately deal with \u2018heavy\u2019 intermediate nodes, where pw or qw are much larger than\nO(1/n). Our proposed approach is much simpler both in terms of algorithm and analysis, and more\nsigni\ufb01cantly, it extends beyond reversible chains to any general discrete state-space Markov chain.\nThe most similar approach to ours is the recent FAST-PPR algorithm of Lofgren et al. [2] for PageR-\nank estimation; our algorithm borrows several ideas and techniques from that work. However, the\nFAST-PPR algorithm relies heavily on the structure of PageRank \u2013 in particular, the fact that the\nPageRank walk has Geometric(\u03b1) length (and hence can be stopped and restarted due to the mem-\noryless property). Our work provides an elegant and powerful generalization of the FAST-PPR\nalgorithm, extending the approach to general Markov chains.\n\n2 The Bidirectional MSTP-estimation Algorithm\n\n2.1 Algorithm\n\nAs described in Section 1.1, given a target state t, our bidirectional MSTP algorithm keeps track of\nt \u2208 Rn \u2013 for each length\na pair of vectors \u2013 the estimate vector qk\nk \u2208 {0, 1, 2, . . . , (cid:96)}. The vectors are initially all set to 0 (i.e., the all-0 vector), except r0\nt which is\ninitialized as et. Moreover, they are updated using a reverse push operation de\ufb01ned as:\n\nt \u2208 Rn and the residual vector rk\n\nAlgorithm 1 REVERSE-PUSH(v, i)\nInputs: Transition matrix P , estimate vector qi\n\nt} and residual-vectors {(cid:101)ri\n1: return New estimate vectors {(cid:101)qi\nt} computed as:\n(cid:101)ri\n(cid:101)ri+1\nt \u2190 ri\nt \u2190 ri+1\n\n(cid:101)qi\nt \u2190 qi\n\nt, residual vectors ri\n\nt + (cid:104)ri\n\nt, ev(cid:105)ev;\n\nt \u2212 (cid:104)ri\n\nt, ev(cid:105)ev;\n\nt, ri+1\n\nt\n\nt, ev(cid:105)(cid:0)evP T(cid:1)\n\nt + (cid:104)ri\n\n2In particular, local power iterations are slow if a state has a very large out-neighborhood (for the forward\n\niteration) or in-neighborhood (for the reverse update).\n\n3\n\n\f\u03c3[t] in terms of {qk\nThe main observation behind our algorithm is that we can re-write p(cid:96)\nexpectation over random sample-paths of the Markov chain as follows (cf. Equation (3)):\n\nt } as an\n\nt , rk\n\n(cid:96)(cid:88)\n\n(cid:2)r(cid:96)\u2212k\n\nt\n\n(Vk)(cid:3)\n\n\u03c3[t] = (cid:104)\u03c3, q(cid:96)\np(cid:96)\n\nt(cid:105) +\n\nEVk\u223c\u03c3P k\n\n(2)\n\nIn other words, given vectors {qk\n\u03c3[t] by sampling a\nlength-(cid:96) random trajectory {V0, V1, . . . , V(cid:96)} of the Markov chain P starting at a random state V0\nsampled from the source distribution \u03c3, and then adding the residuals along the trajectory as in\nEquation (2). We formalize this bidirectional MSTP algorithm in Algorithm 2.\n\nt }, we can get an unbiased estimator for p(cid:96)\n\nt , rk\n\nk=0\n\nAlgorithm 2 Bidirectional-MSTP(P, \u03c3, t, (cid:96)max, \u03b4)\nInputs: Transition matrix P , source distribution \u03c3, target state t, maximum steps (cid:96)max, minimum\n\n1: Set accuracy parameter c based on \u0001 and pf and set reverse threshold \u03b4r (cf. Theorems 1 and 2)\n\nprobability threshold \u03b4, relative error bound \u0001, failure probability pf\n\n(in our experiments we use c = 7 and \u03b4r =(cid:112)\u03b4/c)\n\nt = 0 , \u2200 k \u2208 {0, 1, 2, . . . , (cid:96)},\nt = et and rk\n\nt = 0 , \u2200 k \u2208 {1, 2, 3, . . . , (cid:96)}\n\nend while\n\nwhile \u2203 v \u2208 S s.t.\n\n2: Initialize: Estimate vectors qk\nResidual vectors r0\n3: for i \u2208 {0, 1, . . . , (cid:96)max} do\nt[v] > \u03b4r do\nri\n4:\nExecute REVERSE-PUSH(v, i)\n5:\n6:\n7: end for\n8: Set number of sample paths nf = c(cid:96)max\u03b4r/\u03b4\n9: for index i \u2208 {1, 2, . . . , nf} do\ni \u223c \u03c3\n10:\n11:\n12:\n\ni , V 1\n\n(See Theorem 1 for details)\n\nSample starting node V 0\nGenerate sample path Ti = {V 0\nFor (cid:96) \u2208 {1, 2, . . . , (cid:96)max}: sample k \u223c U nif orm[0, (cid:96)] and compute S(cid:96)\n(We reinterpret the sum over k in Equation 2 as an expectation and sample k rather sum over\nk \u2264 (cid:96) for computational speed.)\n\n} of length (cid:96)max starting from V 0\nt,i = (cid:96)r(cid:96)\u2212k\n\ni , . . . , V (cid:96)max\n\ni\n[V k\ni ]\n\nt\n\ni\n\n13: end for\n\n14: return {(cid:98)p(cid:96)\n\n\u03c3[t]}(cid:96)\u2208[(cid:96)max], where(cid:98)p(cid:96)\n\n\u03c3[t] = (cid:104)\u03c3, q(cid:96)\n\nt(cid:105) + (1/nf )(cid:80)nf\n\ni=1 S(cid:96)\nt,i\n\n2.2 Some Intuition Behind our Approach\n\nBefore formally analyzing the performance of our MSTP-estimation algorithm, we \ufb01rst build some\nintuition as to why it works.\nIn particular, it is useful to interpret the estimates and residues in\nprobabilistic/combinatorial terms. In Figure 1, we have considered a simple Markov chain on three\nstates \u2013 Solid, Hollow and Checkered (henceforth (S, H, C)). On the right side, we have illustrated\nan intermediate stage of reverse work using S as the target, after performing the REVERSE-PUSH\noperations (S, 0), (H, 1), (C, 1) and (S, 2) in that order. Each push at level i uncovers a collection\n\nFigure 1: Visualizing a sequence of REVERSE-PUSH operations: Given the Markov chain on the\nleft with S as the target, we perform REVERSE-PUSH operations (S, 0), (H, 1), (C, 1),(S, 2).\n\n4\n\n\fv or ri\n\nof length-(i + 1) paths terminating at S \u2013 for example, in the \ufb01gure, we have uncovered all length\n2 and 3 paths, and several length 4 paths. The crucial observation is that each uncovered path of\nlength i starting from a node v is accounted for in either qi\nv. In particular, in Figure 1, all paths\nstarting at solid nodes are stored in the estimates of the corresponding states, while those starting at\nblurred nodes are stored in the residue. Now we can use this set of pre-discovered paths to boost the\nestimate returned by Monte Carlo trajectories generated starting from the source distribution. The\ndotted line in the \ufb01gure represents the current reverse-work frontier \u2013 it separates the fully uncovered\nneighborhood of (S, 0) from the remaining states (v, i).\nIn a sense, what the REVERSE-PUSH operation does is construct a sequence of importance-\nsampling weights, which can then be used for Monte Carlo. An important novelty here is that\nthe importance-sampling weights are: (i) adapted to the target state, and (ii) dynamically adjusted\nto ensure the Monte Carlo estimates have low variance. Viewed in this light, it is easy to see how\nthe algorithm can be modi\ufb01ed to applications beyond basic MSTP-estimation: for example, to non-\nhomogenous Markov chains, or for estimating the probability of hitting a target state t for the \ufb01rst\ntime in (cid:96) steps (cf. Section 3). Essentially, we only need an appropriate reverse-push/dynamic\nprogramming update for the quantity of interest (with associated invariant, as in Equation (2)).\n\n2.3 Performance Analysis\n\nWe \ufb01rst formalize the critical invariant introduced in Equation (2):\nt = 0\u2200 k \u2265\nLemma 1. Given a terminal state t, suppose we initialize q0\n0. Then for any source distribution \u03c3 and length (cid:96), after any arbitrary sequence of REVERSE-\nPUSH(v, k) operations, the vectors {qk\n\nt = et and qk\n\nt = 0, r0\n\nt , rk\n\nt , rk\n\nt } satisfy the invariant:\n(cid:104)\u03c3P k, r(cid:96)\u2212k\n(cid:105)\n\n(cid:96)(cid:88)\n\nt(cid:105) +\n\nt\n\n\u03c3[t] = (cid:104)\u03c3, q(cid:96)\np(cid:96)\n\nk=0\n\n(3)\n\nThe proof follows the outline of a similar result in Andersen et al. [4] for PageRank estimation; due\nto lack of space, we defer it to our full version [8]. Using this result, we can now characterize the\naccuracy of the Bidirectional-MSTP algorithm:\nTheorem 1. We are given any Markov chain P , source distribution \u03c3, terminal state t, maximum\nlength (cid:96)max and also parameters \u03b4, pf and \u0001 (i.e., the desired threshold, failure probability and\nrelative error). Suppose we choose any reverse threshold \u03b4r > \u03b4, and set the number of sample-\nwith probability at least 1 \u2212 pf , the estimate returned by Bidirectional-MSTP satis\ufb01es:\n\npaths nf = c\u03b4r/\u03b4, where c = max(cid:8)6e/\u00012, 1/ ln 2(cid:9) ln (2(cid:96)max/pf ). Then for any length (cid:96) \u2264 (cid:96)max\n\n\u03c3[t](cid:12)(cid:12) < max(cid:8)\u0001p(cid:96)\n\n\u03c3[t], \u03b4(cid:9) .\n\n(cid:12)(cid:12)(cid:98)p(cid:96)\n\u03c3[t] \u2212 p(cid:96)\nEquation (2) shows that the estimate(cid:98)p(cid:96)\n\nt,k obeys: (i) E[S(cid:96)\n\nProof. Given any Markov chain P and terminal state t, note \ufb01rst that for a given length (cid:96) \u2264 (cid:96)max,\n\u03c3[t] is an unbiased estimator. Now, for any random-trajectory\nt,k \u2208 [0, (cid:96)\u03b4r]; the \ufb01rst in-\nTk, we have that the score S(cid:96)\nequality again follows from Equation (2), while the second follows from the fact that we executed\nREVERSE-PUSH operations until all residual values were less than \u03b4r.\nNow consider the rescaled random variable Xk = S(cid:96)\nhave that Xk \u2208 [0, 1], E[X] \u2264 (nf /(cid:96)\u03b4r)p(cid:96)\nMoreover, using standard Chernoff bounds (cf. Theorem 1.1 in [9]), we have that:\nP [|X \u2212 E[X]| > \u0001E[X]] < 2 exp\n\n\u03c3[t] and also (X \u2212 E[X]) = (nf /(cid:96)\u03b4r)((cid:98)p(cid:96)\n\nt,k/((cid:96)\u03b4r) and X = (cid:80)\n\nP[X > b] \u2264 2\u2212b for any b > 2eE[X]\n\nk\u2208[nf ] Xk; then we\n\u03c3[t]).\n\n\u03c3[t] \u2212 p(cid:96)\n\n\u03c3[t] and (ii) S(cid:96)\n\n\u2212 \u00012E[X]\n\nt,k] \u2264 p(cid:96)\n\n(cid:18)\n\n(cid:19)\n\nand\n\n3\n\nt,k] > \u03b4/2e (i.e., E[X] > nf \u03b4/2e(cid:96)\u03b4r = c/2e): Here, we can use the \ufb01rst concentration\n\nNow we consider two cases:\n1. E[S(cid:96)\n\nbound to get:\n\nP(cid:2)(cid:12)(cid:12)(cid:98)p(cid:96)\n\n\u03c3[t] \u2212 p(cid:96)\n\n\u03c3[t](cid:12)(cid:12) \u2265 \u0001p(cid:96)\n\n\u03c3[t](cid:3) = P\n\n(cid:20)\n\n|X \u2212 E[X]| \u2265 \u0001nf\n(cid:96)\u03b4r\n\u2264 2 exp\n\n(cid:18)\n\u2212 \u00012E[X]\n\n(cid:19)\n\n\u03c3[t]\n\np(cid:96)\n\n\u2264 2 exp\n\n\u2264 P [|X \u2212 E[X]| \u2265 \u0001E[X]]\n\n(cid:21)\n(cid:18)\n\n(cid:19)\n\n,\n\n\u2212 \u00012c\n6e\n\n3\n\n5\n\n\fwhere we use that nf = c(cid:96)max\u03b4r/\u03b4 (cf. Algorithm 2). Moreover, by the union bound, we have:\n\nP\n\n(cid:8)(cid:12)(cid:12)(cid:98)p(cid:96)\n\n\uf8ee\uf8f0 (cid:91)\n\n\u03c3[t](cid:9)\uf8f9\uf8fb \u2264 2(cid:96)max exp\n\n(cid:19)\nNow as long as c \u2265(cid:0)6e/\u00012(cid:1) ln (2(cid:96)max/pf ), we get the desired failure probability.\n\u03c3[t] \u2212(cid:98)p(cid:96)\n\n\u03c3[t] \u2264 (nf /(cid:96)\u03b4r)E[X] \u2264 \u03b4/2e < \u03b4. On the other hand, we also have:\n\n\u03c3[t](cid:12)(cid:12) \u2265 \u0001p(cid:96)\n\n\u03c3[t] \u2212 p(cid:96)\n\n\u2212 \u00012c\n32e\n\n(cid:96)\u2264(cid:96)max\n\n(cid:18)\n\np(cid:96)\n\n,\n\n2. E[S(cid:96)\n\nt,k] < \u03b4/2e (i.e., E[X] < c/2e): In this case, note \ufb01rst that since X > 0, we have that\n\nP(cid:2)(cid:98)p(cid:96)\n\n\u03c3[t] \u2212 p(cid:96)\n\n\u03c3[t] \u2265 \u03b4(cid:3) = P\n\n(cid:20)\n\n(cid:21)\n\nX \u2212 E[X] \u2265 nf \u03b4\n(cid:96)\u03b4r\n\n\u2264 P [X \u2265 c] \u2264 2\u2212c,\n\nwhere the last inequality follows from our second concentration bound, which holds since we\nhave c > 2eE[X]. Now as before, we can use the union bound to show that the failure probability\nis bounded by pf as long as c \u2265 log2 ((cid:96)max/pf ).\n\nCombining the two cases, we see that as long as c \u2265 max(cid:8)6e/\u00012, 1/ ln 2(cid:9) ln (2(cid:96)max/pf ), then we\nhave P(cid:104)(cid:83)\n\n\u03c3[t](cid:12)(cid:12) \u2265 max{\u03b4, \u0001p(cid:96)\n\n\u03c3[t]}(cid:9)(cid:105) \u2264 pf .\n\n\u03c3[t] \u2212 p(cid:96)\n\n(cid:8)(cid:12)(cid:12)(cid:98)p(cid:96)\n\n(cid:96)\u2264(cid:96)max\n\nOne aspect that is not obvious from the intuition in Section 2.2 or the accuracy analysis is if using a\nbidirectional method actually improves the running time of MSTP-estimation. This is addressed by\nthe following result, which shows that for typical targets, our algorithm achieves signi\ufb01cant speedup:\nTheorem 2. Let any Markov chain P , source distribution \u03c3, maximum length (cid:96)max and parameters\n\u03b4, pf and \u0001 be given. Suppose we set \u03b4r =\n(cid:96)max log((cid:96)max/pf ) . Then for a uniform random choice of\n\n(cid:113)\n\n\u00012\u03b4\n\nt \u2208 S, the Bidirectional-MSTP algorithm has a running time of (cid:101)O\neach of length (cid:96)max \u2013 hence the running time is O(cid:0)c\u03b4(cid:96)2\n\nProof. The runtime of Algorithm 2 consists of two parts:\nForward-work (i.e., for generating trajectories): we generate nf = c(cid:96)max\u03b4r/\u03b4 sample trajectories,\n\nmax/\u03b4(cid:1) for any Markov chain P , source\n\ndistribution \u03c3 and target node t. Substituting for c from Theorem 1, we get that the forward-work\nrunning time Tf = O\nReverse-work (i.e., for REVERSE-PUSH operations): Let Tr denote the reverse-work runtime for\na uniform random choice of t \u2208 S. Then we have:\n\nmax\u03b4r log((cid:96)max/pf )\n\n\u00012\u03b4\n\n.\n\n(cid:16) (cid:96)2\n\n(cid:18)\n\n(cid:113)\n\n(cid:19)\n\n(cid:96)3/2\nmax\n\nd/\u03b4\n\n.\n\n(cid:17)\n\n(cid:88)\n\n(cid:96)max(cid:88)\n\n(cid:88)\n\nt\u2208S\n\nk=0\n\nv\u2208S\n\nE[Tr] =\n\n1\n|S|\n\n(din(v) + 1)1{REVERSE-PUSH(v,k) is executed}\n\nt (v) > \u03b4r, and hence, by Equation (3), we have that pk\nev\n\nresiduals for din(v) + 1 states. Note that(cid:80)\nargument, we have that for any v \u2208 S,(cid:80)\n\nNow for a given t \u2208 S and k \u2208 {0, 1, . . . , (cid:96)max}, note that the only states v \u2208 S on which we\nexecute REVERSE-PUSH(v, k) are those with residual rk\nt (v) > \u03b4r \u2013 consequently, for these states,\n[t] \u2265 \u03b4r (by setting \u03c3 = ev,\nwe have that qk\ni.e., starting from state v). Moreover, a REVERSE-PUSH(v, k) operation involves updating the\n[t] = 1 and hence, via a straightforward counting\n[t]\u2265\u03b4r} \u2264 1/\u03b4r. Thus, we have:\n(cid:19)\n\n(cid:88)\n(cid:96)max(cid:88)\n(cid:80)\n\n(din(v) + 1)1{pk\nev\n\n(din(v) + 1)1{pk\nev\n\nE[Tr] \u2264 1\n|S|\n\nt\u2208S pk\nev\nt\u2208S 1{pk\n\n(cid:96)max(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n[t]\u2265\u03b4r} =\n\n[t]\u2265\u03b4r}\n\n(cid:19)\n\nv\u2208S\n\nt\u2208S\n\nk=0\n\nev\n\n(cid:18) (cid:96)maxd\n\n\u2264 1\n|S|\n\n((cid:96)max + 1) \u00b7 (din(v) + 1)\n\n1\n\u03b4r\n\n= O\n\n= O\n\n\u03b4r\n\n1\n|S|\n\n(cid:18) (cid:96)max\n\n\u03b4r\n\nv\u2208S\n\u00b7\n\nk=0\nv\u2208S din(v)\n\n|S|\n\n(cid:88)\n(cid:88)\n\nt\u2208S\n\nv\u2208S\n\n(cid:113)\n\nFinally, we choose \u03b4r =\n\n(cid:96)max log((cid:96)max/pf ) to balance Tf and Tr and get the result.\n\n\u00012\u03b4\n\n6\n\n\f\u03c3\n\net\n\n\u03c3\n\n(cid:113)\n\n\u03c3[t] for all values of (cid:96) \u2264 (cid:96)max.\n\n3 Applications of MSTP estimation\n\u2022 Estimating the Stationary Distribution and Hitting Probabilities: MSTP-estimation can be\nused in two ways to estimate stationary probabilities \u03c0[t]. First, if we know the mixing time \u03c4mix\nof the chain P , we can directly use Algorithm 2 to approximate \u03c0[t] by setting (cid:96)max = \u03c4mix and\nusing any source distribution \u03c3. Theorem 2 then guarantees that we can estimate a stationary\nprobability of order \u03b4 in time O(\u03c4 3/2\nd/\u03b4). In comparison, Monte Carlo has O(\u03c4mix/\u03b4) run-\nmix\ntime. We note that in practice, we usually do not know the mixing time \u2013 in such a setting, our\nalgorithm can be used to compute an estimate of p(cid:96)\n\nAn alternative is to modify Algorithm 2 to estimate the truncated hitting time (cid:98)p(cid:96),hit\nan estimate for the expected truncated return time E[Tt1{Tt\u2264(cid:96)max}] =(cid:80)\n(cid:96)(cid:98)p(cid:96),hit\n\n{1, 2, . . . , (cid:96)max} (note: not i = 0), instead of REVERSE-PUSH(t, i), we update(cid:101)qi\nt[t], set(cid:101)ri\nresulting quantity(cid:98)p(cid:96),hit\n\n[t](i.e., the\nprobability of hitting t starting from \u03c3 for the \ufb01rst time in (cid:96) steps). By setting \u03c3 = et, we get\n[t]. Now,\nusing that fact that \u03c0[t] = 1/E[Tt], we can get a lower bound for \u03c0[t] which converges to \u03c0[t]\nas (cid:96)max \u2192 \u221e. We note also that the truncated hitting time has been shown to be useful in other\napplications such as identifying similar documents on a document-word-author graph [10].\nTo estimate the truncated hitting time, we modify Algorithm 2 as follows: at each stage i \u2208\nt[t] +\nt[t] to the in-neighbors of t in the (i + 1)th stage. The\nri\nremaining algorithm remains the same. It is easy to see from the discussion in Section 2.2 that the\n[t] is an unbiased estimate of P[Hitting time of t = (cid:96)|X0 \u223c \u03c3] \u2013 we omit\n\u2022 Exact Stationary Probabilities in Strong Doeblin chains: A strong Doeblin chain [11] is ob-\ntained by mixing a Markov chain P and a distribution \u03c3 as follows: at each transition, the pro-\ncess proceeds according to P with probability \u03b1, else samples a state from \u03c3. Doeblin chains are\nwidely used in ML applications \u2013 special cases include the celebrated PageRank metric [12], vari-\nants such as HITS and SALSA [13], and other algorithms for applications such as ranking [14] and\nstructured prediction [15]. An important property of these chains is that if we sample a starting\nnode V0 from \u03c3 and sample a trajectory of length Geometric(\u03b1) starting from V0, then the termi-\nnal node is an unbiased sample from the stationary distribution [16]. There are two ways in which\nour algorithm can be used for this purpose: one is to replace the REVERSE-PUSH algorithm with\na corresponding local update algorithm for the strong Doeblin chain (similar to the one in Ander-\nsen et al. [4] for PageRank), and then sample random trajectories of length Geometric(\u03b1). A\n\u03c3[t]} \u2200 (cid:96) \u2208 [(cid:96)max] and then\nmore direct technique is to choose some (cid:96)max >> 1/\u03b1, estimate {p(cid:96)\n\nt[t] = 0 and do not push back ri\n\na formal proof due to lack of space.\n\nt[t] = qi\n\n(cid:96)\u2264(cid:96)max\n\n\u03c3[t].\n\ni=0 \u03b1i\n\n\u2022 Graph Diffusions:\n\n(cid:96)=1 \u03b1(cid:96)\u22121(1 \u2212 \u03b1)p(cid:96)\n\ndirectly compute the stationary distribution as p[t] =(cid:80)(cid:96)max\ngraph, the resulting scoring functions f (P, \u03c3)[t] := (cid:80)\u221e\nthat for any function f as de\ufb01ned above, the truncated sum f (cid:96)max = (cid:80)(cid:96)max\n||f \u2212 f (cid:96)max||\u221e \u2264 (cid:80)\u221e\n\nIf we assign a weight \u03b1i to random walks of length i on a (weighted)\ndiffusions [17] and are used in a variety of applications. The case where \u03b1i = \u03b1i\u22121(1 \u2212 \u03b1)\nIf instead the length is drawn according to a Poisson distribution\ncorresponds to PageRank.\n(i.e., \u03b1i = e\u2212\u03b1\u03b1i/i!), then the resulting function is called the heat-kernel h(G, \u03b1) \u2013 this too\nhas several applications, including \ufb01nding communities (clusters) in large networks [5]. Note\n\n(cid:0)\u03c3T P i(cid:1) [t] are known as a graph\n(cid:0)pT\n\u03c3 P i(cid:1) obeys\n\n(cid:96)max+1 \u03b1i. Thus a guarantee on an estimate for the truncated sum directly\ntranslates to a guarantee on the estimate for the diffusion. We can use MSTP-estimation to ef\ufb01-\nciently estimate these truncated sums. We perform numerical experiments on heat kernel estima-\ntion in the next section.\n\u2022 Conductance Testing in Graphs: MSTP-estimation is an essential primitive for conductance\ntesting in large Markov chains [1].\nIn particular, in regular undirected graphs, Kale et al [6]\ndevelop a sublinear bidirectional estimator based on counting collisions between walks in order\nto identify \u2018weak\u2019 nodes \u2013 those which belong to sets with small conductance. Our algorithm can\nbe used to extend this process to any graph, including weighted and directed graphs.\n\u2022 Local Algorithms: There is a lot of interest recently on local algorithms \u2013 those which perform\ncomputations given only a small neighborhood of a source node [18]. In this regard, we note that\nBidirectional-MSTP gives a natural local algorithm for MSTP estimation, and thus for the\napplications mentioned above \u2013 given a k-hop neighborhood around the source and target, we can\nperform Bidirectional-MSTP with (cid:96)max set to k. The proof of this follows from the fact\nthat the invariant in Equation (2) holds after any sequence of REVERSE-PUSH operations.\n\ni=0 \u03b1i\n\n7\n\n\fFigure 2: Estimating heat kernels: Bidirectional MSTP-estimation vs. Monte Carlo, Forward Push.\nTo compare runtimes, we choose parameters such that the mean relative error of all algorithms is\naround 10%. Notice that Bidirectional-MSTP is 100 times faster than the other algorithms.\n\n4 Experiments\n\nTo demonstrate the ef\ufb01ciency of our algorithm on large Markov chains, we use heat kernel esti-\nmation (cf. Section 3) as an example application. The heat kernel is a non-homogenous Markov\nchain, de\ufb01ned as the probability of stopping at the target on a random walk from the source, where\nthe walk length is sampled from a P oisson((cid:96)) Distribution. In real-world graphs, a heat-kernel\nvalue between a pair of nodes has been shown to be a good indicator of an underlying community\nrelationship [5] \u2013 this suggests that it can serve as a metric for personalized search on social net-\nworks. For example, if a social network user s wants to view a list of users attending some event,\nthen sorting these users by heat kernel values will result in the most similar users to s appearing on\ntop. Bidirectional-MSTP is ideal for such personalized search applications, as the set of users\n\ufb01ltered by a search query is typically much smaller than the set of nodes on the network.\nIn Figure 2, we compare the runtime of different algorithms for heat kernel computation on four\nreal-world graphs, ranging from millions to billions of edges 3. For each graph, for random (source,\ntarget) pairs, we compute the heat kernel using Bidirectional-MSTP, as well as two bench-\nmark algorithms \u2013 Monte Carlo, and the Forward Push algorithm (as presented in [5]). All three\nalgorithms have parameters which allow them to trade off speed and accuracy \u2013 for a fair compar-\nison, we choose parameters such that the empirical mean relative error each algorithm is 10%. All\nthree algorithms were implemented in Scala \u2013 for the forward push algorithm, our implementation\nfollows the code linked from [5]).\n\u221a\nWe set average walk-length (cid:96) = 5 (since longer walks will mix into the stationary distribution), and\n(cid:96) \u2248 27; the probability of a walk being longer than this is 10\u221212,\nset the maximum length to (cid:96) + 10\nwhich is negligible. For reproducibility, our source code is available on our website (cf. [8]).\nFigure 2 shows that across all graphs, Bidirectional-MSTP is 100x faster than the two bench-\nmark algorithms. For example, on the Twitter graph, it can estimate a heat kernel score is 0.1\nseconds, while the the other algorithms take more than 4 minutes. We note though that Monte Carlo\nand Forward Push can return scores from the source to all targets, rather than just one target \u2013 thus\nBidirectional-MSTP is most useful when we want the score for a small set of targets.\n\nAcknowledgments\n\nResearch supported by the DARPA GRAPHS program via grant FA9550-12-1-0411, and by NSF\ngrant 1447697. Peter Lofgren was supported by an NPSC fellowship. Thanks to Ashish Goel and\nother members of the Social Algorithms Lab at Stanford for many helpful discussions.\n\n3Pokec [19], Live Journal [20], and Orkut [20] datasets are from the SNAP [21]; Twitter-2010 [22] was\n\ndownloaded from the Laboratory for Web Algorithmics [23]. Refer to our full version [8] for details.\n\n8\n\n\fReferences\n\n[1] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity\nand Cryptography. Miscellanea on the Interplay between Randomness and Computation. Springer, 2011.\n[2] Peter Lofgren, Siddhartha Banerjee, Ashish Goel, and C Seshadhri. FAST-PPR: Scaling personalized\n\nPageRank estimation for large graphs. In ACM SIGKDD\u201914, 2014.\n\n[3] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using PageRank vectors. In IEEE\n\nFOCS\u201906, 2006.\n\n[4] Reid Andersen, Christian Borgs, Jennifer Chayes, John Hopcraft, Vahab S Mirrokni, and Shang-Hua\nIn Algorithms and Models for the Web-Graph.\n\nTeng. Local computation of PageRank contributions.\nSpringer, 2007.\n\n[5] Kyle Kloster and David F Gleich. Heat kernel based community detection. In ACM SIGKDD\u201914, 2014.\n[6] Satyen Kale, Yuval Peres, and C Seshadhri. Noise tolerance of expanders and sublinear expander recon-\n\nstruction. In IEEE FOCS\u201908, 2008.\n\n[7] Rajeev Motwani, Rina Panigrahy, and Ying Xu. Estimating sum by weighted sampling. In Automata,\n\nLanguages and Programming, pages 53\u201364. Springer, 2007.\n\n[8] Siddhartha Banerjee and Peter Lofgren. Fast bidirectional probability estimation in markov models. Tech-\n\nnical report, 2015. http://arxiv.org/abs/1507.05998.\n\n[9] Devdatt P Dubhashi and Alessandro Panconesi. Concentration of measure for the analysis of randomized\n\nalgorithms. Cambridge University Press, 2009.\n\n[10] Purnamrita Sarkar, Andrew W Moore, and Amit Prakash. Fast incremental proximity search in large\ngraphs. In Proceedings of the 25th international conference on Machine learning, pages 896\u2013903. ACM,\n2008.\n\n[11] Wolfgang Doeblin. Elements d\u2019une theorie generale des chaines simples constantes de markoff.\n\nIn\nAnnales Scienti\ufb01ques de l\u2019Ecole Normale Sup\u00b4erieure, volume 57, pages 61\u2013111. Soci\u00b4et\u00b4e math\u00b4ematique\nde France, 1940.\n\n[12] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking:\n\nbringing order to the web. 1999.\n\n[13] Ronny Lempel and Shlomo Moran. The stochastic approach for link-structure analysis (SALSA) and the\n\nTKC effect. Computer Networks, 33(1):387\u2013401, 2000.\n\n[14] Sahand Negahban, Sewoong Oh, and Devavrat Shah. Iterative ranking from pair-wise comparisons. In\n\nAdvances in Neural Information Processing Systems, pages 2474\u20132482, 2012.\n\n[15] Jacob Steinhardt and Percy Liang. Learning fast-mixing models for structured prediction. In ICML\u201915,\n\n2015.\n\n[16] Krishna B Athreya and \u00a8Orjan Sten\ufb02o. Perfect sampling for Doeblin chains. Sankhy\u00afa: The Indian Journal\n\nof Statistics, pages 763\u2013777, 2003.\n\n[17] Fan Chung. The heat kernel as the pagerank of a graph. Proceedings of the National Academy of Sciences,\n\n104(50):19735\u201319740, 2007.\n\n[18] Christina E Lee, Asuman Ozdaglar, and Devavrat Shah. Computing the stationary distribution locally. In\n\nAdvances in Neural Information Processing Systems, pages 1376\u20131384, 2013.\n\n[19] Lubos Takac and Michal Zabovsky. Data analysis in public social networks. In International. Scienti\ufb01c\n\nConf. & Workshop Present Day Trends of Innovations, 2012.\n\n[20] Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee.\nMeasurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/Usenix Internet\nMeasurement Conference (IMC\u201907), San Diego, CA, October 2007.\n\n[21] Stanford Network Analysis Platform (SNAP). http://http://snap.stanford.edu/. Ac-\n\ncessed: 2014-02-11.\n\n[22] Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. Layered label propagation: A multi\n\nresolution coordinate-free ordering for compressing social networks. In ACM WWW\u201911, 2011.\n\n[23] Laboratory for Web Algorithmics. http://law.di.unimi.it/datasets.php. Accessed: 2014-\n\n02-11.\n\n9\n\n\f", "award": [], "sourceid": 867, "authors": [{"given_name": "Siddhartha", "family_name": "Banerjee", "institution": "Cornell University"}, {"given_name": "Peter", "family_name": "Lofgren", "institution": "Stanford University"}]}