{"title": "Deanonymization in the Bitcoin P2P Network", "book": "Advances in Neural Information Processing Systems", "page_first": 1364, "page_last": 1373, "abstract": "Recent attacks on Bitcoin's peer-to-peer (P2P) network demonstrated that its transaction-flooding protocols, which are used to ensure network consistency, may enable user deanonymization---the linkage of a user's IP address with her pseudonym in the Bitcoin network. In 2015, the Bitcoin community responded to these attacks by changing the network's flooding mechanism to a different protocol, known as diffusion. However, it is unclear if diffusion actually improves the system's anonymity. In this paper, we model the Bitcoin networking stack and analyze its anonymity properties, both pre- and post-2015. The core problem is one of epidemic source inference over graphs, where the observational model and spreading mechanisms are informed by Bitcoin's implementation; notably, these models have not been studied in the epidemic source detection literature before. We identify and analyze near-optimal source estimators. This analysis suggests that Bitcoin's networking protocols (both pre- and post-2015) offer poor anonymity properties on networks with a regular-tree topology. We confirm this claim in simulation on a 2015 snapshot of the real Bitcoin P2P network topology.", "full_text": "Deanonymization in the Bitcoin P2P Network\n\nGiulia Fanti and Pramod Viswanath\n\nAbstract\n\nRecent attacks on Bitcoin\u2019s peer-to-peer (P2P) network demonstrated that its\ntransaction-\ufb02ooding protocols, which are used to ensure network consistency,\nmay enable user deanonymization\u2014the linkage of a user\u2019s IP address with her\npseudonym in the Bitcoin network. In 2015, the Bitcoin community responded\nto these attacks by changing the network\u2019s \ufb02ooding mechanism to a different\nprotocol, known as diffusion. However, it is unclear if diffusion actually improves\nthe system\u2019s anonymity. In this paper, we model the Bitcoin networking stack and\nanalyze its anonymity properties, both pre- and post-2015. The core problem is\none of epidemic source inference over graphs, where the observational model and\nspreading mechanisms are informed by Bitcoin\u2019s implementation; notably, these\nmodels have not been studied in the epidemic source detection literature before.\nWe identify and analyze near-optimal source estimators. This analysis suggests\nthat Bitcoin\u2019s networking protocols (both pre- and post-2015) offer poor anonymity\nproperties on networks with a regular-tree topology. We con\ufb01rm this claim in\nsimulation on a 2015 snapshot of the real Bitcoin P2P network topology.\n\n1\n\nIntroduction\n\nThe Bitcoin cryptocurrency has seen widespread adoption, due in part to its reputation as a privacy-\npreserving \ufb01nancial system [17, 22]. In practice, though, Bitcoin exhibits serious privacy vulner-\nabilities [3, 19, 27, 28, 24]. Most of these vulnerabilities arise because of two key properties: (1)\nBitcoin associates each user with a pseudonym, and (2) pseudonyms can be linked to \ufb01nancial trans-\nactions through a public transaction ledger, called the blockchain [23]. If an attacker can associate a\npseudonym with a human identity, the attacker may learn the user\u2019s transaction history.\nIn practice, there are several ways to link a user to her Bitcoin pseudonym. The most commonly-\nstudied methods analyze transaction patterns in the public blockchain, and link those patterns using\nside information [3, 19, 27, 28, 24]. In this paper, we are interested in a lower-layer vulnerability: the\nnetworking stack. Like most cryptocurrencies, Bitcoin nodes communicate over a P2P network [23].\nWhenever a user (Alice) generates a transaction (i.e., sends bitcoins to another user, Bob), she \ufb01rst\ncreates a \u201ctransaction message\u201d that contains her pseudonym, Bob\u2019s pseudonym, and the transaction\namount. Alice subsequently \ufb02oods this transaction message over the P2P network, which enables\nother users to validate her transaction and incorporate it into the global blockchain.\nThe anonymity implications of transaction broadcasting were largely ignored until recently, when\nresearchers demonstrated practical deanonymization attacks on the P2P network [6, 15]. These\nattacks use a \u201csupernode\u201d to connect to all active Bitcoin nodes and listen to the transaction traf\ufb01c\nthey relay [15, 6, 7]. By using simple estimators to infer the source IP of each transaction broadcast,\nthis eavesdropper adversary was able to link IP addresses to Bitcoin pseudonyms with an accuracy\nof up to 30% [6]. We refer to such linkage as deanonymization.\n\nGiulia Fanti (gfanti@andrew.cmu.edu) is in the ECE Department at Carnegie Mellon University. Pramod\nViswanath (pramodv@illinois.edu) is in the ECE Department at the University of Illinois at Urbana-\nChampaign. This work was funded by NSF grant CCF-1705007.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn 2015, the Bitcoin community responded to these attacks by changing its \ufb02ooding protocols from\na gossip-style protocol known as trickle spreading to a diffusion spreading protocol that spreads\ncontent with independent exponential delays [1]. We de\ufb01ne these protocols precisely in Section 2.\nHowever, no systematic motivation was provided for this shift. Indeed, it is unclear whether the\nchange actually defends against the deanonymization attacks in [6, 15].\nProblem and contributions. The main point of our paper is to show that Bitcoin\u2019s \ufb02ooding protocols\nhave poor anonymity properties, and the community\u2019s shift from trickle spreading (pre-2015) to\ndiffusion spreading (post-2015) did not help the situation. The problem of deanonymizing a user\nin this context is mathematically equivalent to inferring the source of a random spreading process\nover a graph, given partial observations of the spread. The optimal (maximum-likelihood) source-\nidenti\ufb01cation algorithms change between spreading protocols; identifying such algorithms and\nquantifying their accuracy is the primary focus of this work. We \ufb01nd that despite having different\nmaximum-likelihood estimators, trickle and diffusion exhibit roughly the same, poor anonymity\nproperties. Our speci\ufb01c contributions are threefold:\n(1) Modeling. We model the Bitcoin P2P network and an eavesdropper adversary, whose capabilities\nre\ufb02ect recent practical attacks in [6, 15]. Most Bitcoin network protocols are not explicitly docu-\nmented, so modeling the system requires parsing a combination of documentation, papers, and code.\nSeveral of the resulting models are new to the epidemic source detection literature.\n(2) Analysis of Trickle (Pre-2015). We analyze the probability of deanonymization by an eavesdropper\nadversary under trickle propagation. Our analysis is conducted over a regular tree-structured network.\nAlthough the Bitcoin network topology is not a regular tree, we show in Section 2 that regular trees are\na reasonable \ufb01rst-order model. We consider graph-independent estimators (e.g., the \ufb01rst-timestamp\nestimator), as well as maximum-likelihood estimators; both are de\ufb01ned precisely in Section 2. Our\nanalysis suggests that although the \ufb01rst-timestamp estimator performs poorly on high-degree trees,\nmaximum-likelihood estimators achieve high probabilities of detection for trees of any degree d.\n(3) Analysis of Diffusion (Post-2015). We conduct a similar analysis of diffusion spreading, which was\nadopted in 2015 as a \ufb01x for the anonymity weaknesses observed under trickle propagation [6, 15]. The\nanalysis of diffusion requires different theoretical tools, including nonlinear differential equations and\ngeneralized P\u00f2lya urns. Although the analysis techniques and attack mechanisms are different, we \ufb01nd\nthat the anonymity properties of diffusion are similar to those of trickle. Namely, the \ufb01rst-timestamp\nestimator\u2019s probability of detection decays to 0 as degree d grows, but the maximum-likelihood\nprobability of detection remains high (in particular, non-vanishing) even as d ! 1.\n2 Model and related work\n\nNetwork model. We model the P2P network of Bitcoin nodes as a graph G(V, E), where V is\nthe set of all server nodes and E is the set of edges, or connections, between them. Each server is\nrepresented by a (IP address, port) tuple; it can establish up to eight outgoing connections to other\nBitcoin nodes [6, 2]. The resulting sparse random graph between nodes can be modeled approximately\nas a 16-regular graph; in practice, the average degree is closer to 8 due to nonhomogeneities across\nnodes [20]. The graph is locally tree-like and (approximately) regular. For this reason, regular trees\nare a natural class of graphs to study. In our theoretical analysis, we model G as a d-regular tree. We\nvalidate this choice by running simulations on a snapshot of the true Bitcoin network [20] (Section 5).\nSpreading protocols. Each transaction must be broadcast over the network; we analyze the spread\nof a single message originating from source node v\u21e4 2 V . Without loss of generality, we label v\u21e4 as\nnode \u20180\u2019 when iterating over nodes. At time t = 0, the message starts spreading according to one of\ntwo randomized protocols: trickle (pre-2015) or diffusion (post-2015).\nTrickle spreading is a gossip-based \ufb02ooding protocol. Each source or relay chooses a neighboring\npeer (called the \u2018trickle\u2019 node) uniformly at random, every 200 ms. If the trickle node has not\nyet received the message, the sender forwards the message [6].1 We model this by considering a\ncanonical, simpler spreading protocol of round-robin gossip. In round-robin gossip, each source or\nrelay randomly orders its neighbors who have not yet seen the message; we call these uninfected\nneighbors. In each successive (discrete) timestep, the node transmits the message to the next neighbor\n\n1This description omits some details of trickle spreading, which we do not consider in our analysis. For\nexample, with probability 1/4, each relay forwards the message instantaneously to its neighbors without trickling.\n\n2\n\n\fin its ordering. Thus, if a node has d neighbors, all d neighbors will receive the message within d\ntimesteps. This differs from trickle spreading, where the time-to-infection is a coupon collector\u2019s\nproblem, and therefore takes \u21e5(d log d) timesteps in expectation [8]. We will henceforth abuse\nterminology by referring to round-robin gossip as trickle spreading.\nIn diffusion, each source or relay node transmits the message to each of its uninfected neighbors with\nan independent, exponential delay of rate . In practice, Bitcoin uses a higher rate on outgoing edges\nthan incoming ones [2]; we omit this distinction in our model. We assume a continuous-time system,\nwith each node starting the exponential clocks upon receipt (or creation) of a message.\nFor both protocols, we let Xv denote the timestamp at which node v 2 V receives a given message.\nNote that server nodes cannot be infected more than once. We assume the message originates at time\nt = 0, so Xv\u21e4 = X0 = 0. Moreover, we let Gt(Vt, Et) denote the infected subgraph of G at time\nt, or the subgraph of nodes who have received the message (but not necessarily reported it to the\nadversary) by time t.\nAdversarial model. The adversary\u2019s goal is to link a message with the source (IP address, port)\u2014i.e.,\nto identify the source node v\u21e4 2 V . We consider an eavesdropper adversary, whose capabilities\nare modeled on the practical deanonymization attacks in [6, 15]. These attacks use a supernode that\nconnects to most of the servers in the Bitcoin network. It can make multiple connections to each\nhonest server, with each connection coming from a different (IP address, port). Hence, the honest\nserver does not realize that the supernode\u2019s connections are all from the same entity. We model this\nby assuming that the eavesdropper adversary makes a \ufb01xed number \u2713 of connections to each server,\nwhere \u2713  1. We do not include these adversarial connections in the original server graph G, so G\nremains a d-regular graph (see Figure 1). The supernode can learn the network structure between\nservers [6], so we assume that G(V, E) is known to the eavesdropper.\nThe supernode in [6, 15] observes the timestamps at which\nmessages are relayed from each honest server, without\nrelaying or transmitting content. If the adversary maintains\nmultiple active connections to each server (\u2713> 1), it\nreceives the message \u2713 times from each server. We let\n\u2327v denote the time at which the adversary \ufb01rst observes\nthe message from node v 2 V . We let \u2327 = (\u2327v)v2V\ndenote the set of all observed \ufb01rst-timestamps. We assume\ntimestamps are relative to time t = 0, i.e., the adversary\nknows when the message started spreading.\nSource estimation. The adversary\u2019s goal is as follows:\ngiven the observed timestamps \u2327 (up to estimation time\nt) and the graph G, \ufb01nd an estimator M(\u2327 , G) that outputs\nthe true source. Our metric of success for the adversary is probability of detection, P(M(\u2327 , G) = v\u21e4),\ntaken over the random spreading realization (captured by \u2327 ) and any randomness in the estimator.\nIn [6, 15], the adversary uses a variant of the \ufb01rst-timestamp estimator MFT(\u2327 , G) = arg minv2Vt \u2327v,\nwhich outputs the \ufb01rst node (prior to estimation time t) to report the message to the adversary. The\n\ufb01rst-timestamp estimator requires no knowledge of the graph, and it is computationally easy to\nimplement. We begin by analyzing this estimator for both trickle and diffusion propagation.\nWe also consider the maximum-likelihood (ML) estimator: MML(\u2327 , G) = arg maxv2V P(\u2327|G, v\u21e4 =\nv). The ML estimator depends on the time of estimation t to the extent that \u2327 only contains timestamps\nup to time t. Unlike the \ufb01rst-timestamp estimator, the ML estimator differs across spreading protocols,\ndepends on the graph, and may be computationally intractable in general.\nProblem statement. Our goal is to understand whether the Bitcoin community\u2019s move from trickle\nspreading to diffusion actually improved the system\u2019s anonymity guarantees. The problem at hand is\nto characterize the maximum-likelihood (ML) probability of detection of the eavesdropper adversary\nfor both trickle and diffusion processes on d-regular trees, as a function of degree d, number of\ncorrupted connections \u2713, and detection time t. We meet this goal by computing lower bounds derived\nfrom the analysis of suboptimal estimators (e.g., \ufb01rst-timestamp estimator and centrality-based\nestimators), and upper bounds derived from fundamental limits on detection.\n\nFigure 1: The eavesdropper adversary\nestablishes \u2713 links (in red) to each server.\nHonest servers are connected in a d-\nregular tree topology (edges in black).\n\n&=2\n)=3\n\nEavesdropper\n\n$\u2217\n\n3\n\n\fRelated work. Although there has been much work on the anonymity properties of Bitcoin [19,\n28, 24, 27], the \u2018epidemic source \ufb01nding\u2019 interpretation of Bitcoin deanonymization is fairly new.\nPrior work that (implicitly) adopts this interpretation has focused on Bitcoin\u2019s protocol \ufb02aws more\nthan the inference aspect of the problem [6, 15]. As this is the focus of our paper, we include the\nrelated source detection literature. Epidemic source detection has been widely studied under diffusion\nspreading with a snapshot adversary, which observes the set of infected nodes at a single time t; in\nour notation, the adversary would learn the set {v 2 V : Xv \uf8ff t} (no timestamps), along with graph\nG. Shah and Zaman \ufb01rst characterized the ML probability of detection for diffusion observed by a\nsnapshot adversary when the underlying graph is a regular tree [29]. These results were later extended\nto random, irregular trees [31], whereas other authors studied heuristic source detection methods on\ngeneral graphs [12, 26, 16] and related theoretical limits [32, 21, 14]. The eavesdropper adversary\ndiffers in that it eventually observes a noisy timestamp \u2327v from every node, regardless of when the\nnode is infected. This changes both the analysis and the estimators that one can use. Another common\nadversarial model is the spy-based adversary, which observes exact timestamps for a corrupted set\nof nodes that does not include the source [25, 34]. In our notation, for a set of spies S \u2713 V , the\nspy-based adversary observes {(s, Xs) : s 2 S}. Prior work on the spy-based adversary does not\ncharacterize the ML probability of detection, but researchers have proposed ef\ufb01cient heuristics that\nperform well in practice [25, 34, 35, 9]. Unlike the spy-based adversary, the eavesdropper only\nobserves delayed timestamps, and it does so for all nodes, including the source.\n\n3 Analysis of trickle (pre-2015)\n\n3.1 First-timestamp estimator\n\nThe analysis of trickle propagation is complicated by its combinatorial, time-dependent na-\nture. As such, we lower-bound the \ufb01rst-timestamp estimator\u2019s probability of detection. Let\n\u2327m , min(\u23271,\u2327 2, . . .) denote the minimum observed timestamp among nodes that are not the\nsource. Then we compute P(\u23270 <\u2327 m), i.e., the probability that the true source reports the message\nto the adversary strictly before any of the other nodes. This event (which causes the source to be\ndetected with probability 1) does not include cases where the true source is one of k nodes (k > 1)\nthat report the message to the adversary simultaneously, and before any other node in the system.\nNonetheless, for large node degree d, the \u2018simultaneous reporting\u2019 event is rare, so our lower bound\nis close to the empirical probability of detection of the \ufb01rst-timestamp estimator.\n\nTheorem 3.1 (Proof in Appendix C.1) Consider a message that propagates according to trickle\nspreading over a d-regular tree of servers, where each node additionally has \u2713 connections to an\neavesdropping adversary. The \ufb01rst-timestamp estimator\u2019s probability of detection at time t = 1\nd1+\u2713 , and Ei(x) ,\nsatis\ufb01es P(MFT(\u2327 , G) = v\u21e4)  \u2713\nR 1\n\nd log 2\u21e5Ei(2d log \u21e2)  Ei (log \u21e2)\u21e4 where \u21e2 = d1\n\ndenotes the exponential integral.\n\nx\n\nFigure 2: First-timestamp estimator accuracy\non d-regular trees when \u2713 = 1.\ndetection, so we wish to understand how tight the bound is.\n\n4\n\netdt\n\nt\n\n0.65\n0.6\n0.55\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\nn\no\ni\nt\nc\ne\nt\ne\nD\n\nf\no\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\nTheoretical lower bound\nlog(d) / (d log(2))\nSimulation\n\n2\n\n4\n\n6\n\n8\n\n10\n\nTree degree, d\n\nWe prove this bound by conditioning on the time at\nwhich the source reports to the adversary. The proof\nthen becomes a combinatorial counting problem. The\nexpression in Theorem 3.1 can be simpli\ufb01ed by exam-\nining its Taylor expansion (see Appendix A). In par-\nticular, for the special case of \u2713 = 1 where the adver-\nsary establishes only one connection per server, line\n(5) simpli\ufb01es to P(MFT(\u2327 , G)) \u21e1 log d\nThis suggests that the \ufb01rst-timestamp estimator has\na probability of detection that decays to zero asymp-\ntotically as log(d)/d. Intuitively, the probability of\ndetection should decay to zero, because the higher\nthe degree of the tree, the higher the likelihood that\na node other than the source reports to the adversary\nbefore the source does. Nonetheless, this is only a\nlower bound on the \ufb01rst-timestamp\u2019s probability of\n\nd\u00b7log 2 +o\u21e3 log d\nd \u2318 .\n\n\fSimulation. To evaluate the lower bound in Theorem 3.1 and its approximation for \u2713 = 1, we\nsimulate the \ufb01rst-timestamp estimator on regular trees.2 Figure 2 illustrates the simulation results\nfor \u2713 = 1 compared to the approximation above. Each data point is averaged over 5,000 trials.\nEmpirically, the lower bound appears to be tight, especially as d grows. Figure 2 suggest a natural\nsolution to improve anonymity in the Bitcoin network: increase the degree of each node to reduce the\nadversary\u2019s probability of detection. However, we shall see in the next section that stronger estimators\n(e.g., the ML estimator) may achieve high probabilities of detection, even for large d.\n\n3.2 Maximum-likelihood estimator\n\nAt any time t, if one knew the ground truth timestamps (i.e., the Xv\u2019s), one could arrange the nodes\nof the infected subgraph Gt in the order they received the message. We call such an arrangement an\nordering of nodes. Since propagation is in discrete time, multiple nodes may receive the message\nsimultaneously; such nodes are lumped together in the ordering. Of course, the true ordering is not\nobserved by the adversary, but the observed timestamps (i.e., \u2327 ) restrict the set of possible orderings.\nA feasible ordering is an ordering that respects the rules of trickle propagation over graph G, as well\nas the observed timestamps \u2327 . In this subsection only, we will abuse notation by using \u2327 to refer\nto all timestamps observed by the adversary, not just the \ufb01rst timestamp from each server. So if the\nadversary has \u2713 connections to each server, \u2327 would include \u2713 timestamps per honest server.\nWe propose an estimator called timestamp rumor centrality, which counts the number of feasible\norderings originating from each candidate source. The candidate with the most feasible orderings is\nchosen as the estimator output. This estimator is similar to rumor centrality, an estimator devised for\nsnapshot adversaries in [29]. However, the presence of timestamps and the lack of knowledge of the\ninfected subgraph increases the estimator\u2019s complexity. We \ufb01rst motivate timestamp rumor centrality.\n\nProposition 3.2 (Proof in Appendix C.2) Consider a trickle process over a d-regular graph, where\neach node has \u2713 connections to the eavesdropper adversary. Any feasible orderings o1 and o2 with\nrespect to observed timestamps \u2327 and graph G have the same likelihood.\n\nProposition 3.2 implies that at any \ufb01xed time, the likelihood of observing \u2327 given a candidate source\nis proportional to the number of feasible orderings originating from that candidate source. Therefore,\nan ML estimator (timestamp rumor centrality) counts the number of feasible orderings at estimation\ntime t. Timestamp rumor centrality is a message-passing algorithm that proceeds as follows: for\neach candidate source, recursively determine the set of feasible times when each node could have\nbeen infected, given the observed timestamps. This is achieved by passing a set of \u201cfeasible times of\nreceipt\" from the candidate source to the leaves of the largest feasible infected subtree rooted at the\ncandidate source. In each step, nodes prune receipt times that con\ufb02ict with their observed timestamps.\nNext, given each node\u2019s set of feasible receipt times, they count the number of feasible orderings\nthat obey the rules of trickle propagation. This is achieved by passing sets of partial orderings from\nthe leaves to the candidate source, and pruning infeasible orderings. The timestamp rumor centrality\nprotocol is presented in Appendix A.2, along with minor modi\ufb01cations that reduce its complexity.\n\n\u22123\n\n.210=4\n\u22121\n5210=3\n\n.10=2\n1\n510=1\n\n2\n570=3\n\n3\n580=4\n\n\u22122\n5270=4\n\n./\u22170=2\n$\u2217\n5/\u22170=0\n\nIn [31], precise analysis of standard rumor cen-\ntrality was possible because rumor centrality\ncan be reduced to a simple counting problem.\nSuch an analysis is more challenging for times-\ntamp rumor centrality, because timestamps pre-\nvent us from using the same counting argument.\nHowever, we identify a suboptimal, simpli\ufb01ed\nversion of timestamp rumor centrality that ap-\nproaches optimal probabilities of detection as t\ngrows. We call this estimator ball centrality.\nBall centrality checks whether a candidate\nsource v could have generated each of the ob-\nserved timestamps, independently. For example, Figure 3 contains a sample spread on a line graph,\nwhere the adversary has one connection per server (not shown). Therefore, d = 2 and \u2713 = 1. The\nground truth infection time is written as Xv below each node, and the observed timestamps are written\n\nFigure 3: Example of ball centrality on a line with\none link to the adversary per server (these links are\nnot shown). The estimator is run at time t = 4.\n\n2Code for all simulations available at https://github.com/gfanti/bitcoin-trickle-diffusion.\n\n5\n\n\fabove the node. In this \ufb01gure, the estimator is run at time t = 4, so the adversary only sees three\ntimestamps. For each observed timestamp \u2327v, the estimator creates a ball of radius \u2327v  1, centered\nat v. For example, in our \ufb01gure, the green node (node 1) has \u23271 = 2. Therefore, the adversary would\nmake a ball of radius 1 centered at node 1; this ball is depicted by the green bubble in our \ufb01gure. The\nball represents the set of nodes that are close enough to node 1 to feasibly report to the adversary\nfrom node 1 at time \u23271 = 2. After constructing an analogous ball for every observed timestamp in \u2327 ,\nthe protocol outputs a source selected uniformly from the intersection of these balls. In our example,\nthere are exactly two nodes in this intersection. We describe ball centrality precisely in Protocol 1\n(Appendix A.2.1). Although ball centrality is not ML for a \ufb01xed time t, the following theorem lower\nbounds the ML probability of detection by analyzing ball centrality and showing that its probability\nof detection approaches a fundamental upper bound exponentially fast in detection time t.\n\nTheorem 3.3 (Proof in Section C.3) Consider a trickle spreading process over a d-regular graph of\nhonest servers. In addition, each server has \u2713 independent connections to an eavesdropper adversary.\nThe ML probability of detection at time t satis\ufb01es the following expression:\n\nd\n\n2(\u2713 + d) \u2713 d\n\n\u2713 + d\u25c6t\n\n1 \n\n(a)\n\n\uf8ff P(MML(\u2327 , G) = v\u21e4)\n\n(b)\n\n\uf8ff 1 \n\nd\n\n2(\u2713 + d)\n\n(1)\n\nNote that the right-hand side of equation (1) is always greater than 1\n2. As such, increasing the graph\ndegree would not signi\ufb01cantly reduce the probability of detection; the adversary can still identify\nthe source with probability at least 1\n2 given enough time. Second, the ML probability of detection\napproaches its upper bound exponentially fast in time t. This suggests that the adversary can achieve\nhigh probabilities of detection at small times t. These results highlight an important point: estimators\nthat exploit graph structure can reap signi\ufb01cant, order-level gains in accuracy.\n\n4 Analysis of diffusion (post-2015)\n\n4.1 First-timestamp estimator\n\nAlthough the \ufb01rst-timestamp estimator does not use knowledge of the underlying graph, its perfor-\nmance depends on the underlying graph structure. The following theorem exactly characterizes its\nprobability of detection on a regular tree as t ! 1.\nTheorem 4.1 (Proof in Appendix C.4) Consider a diffusion process of rate  = 1 over a d-regular\ntree, d > 2. Suppose an adversary observes each node\u2019s infection time with an independent,\nexponential delay of rate 2 = \u2713, \u2713  1. Then the following expression describes the probability of\nd2 log d+\u27132\n .\ndetection for the \ufb01rst-timestamp estimator at time t = 1: P(MFT(\u2327 , G) = v\u21e4) = \u2713\nThe proof expresses the probability of detection as a nonlinear differential equation that can be\nsolved exactly. The expression highlights a few points: First, for a \ufb01xed degree d, the probability of\ndetection is strictly positive as t ! 1. This is straightforward to see, but under other adversarial\nmodels (e.g., snapshot adversaries) it is not trivial to see that the probability of detection is positive as\nt ! 1. Indeed, several papers are dedicated to making that point [30, 31]. Second, when \u2713 = 1, i.e.,\nthe adversary has only one connection per node, the probability of detection approaches log(d)/d\nasymptotically in d. This quantity tends to 0 as d ! 1, and it is order-equal to the probability of\ndetection of the \ufb01rst-timestamp adversary on the trickle protocol when \u2713 = 1 (see Section 3.1).\nTheorem 4.1 suggests that the Bitcoin community\u2019s transition from trickle spreading to diffusion does\nnot provide order-level anonymity gains (asymptotically in the degree of the graph), at least for the\n\ufb01rst-timestamp adversary. Next, we ask if the same is true for estimators that use the graph structure.\n\n\u2713\n\n4.2 Centrality-based estimators\n\nWe compute a different lower bound on the ML probability of detection by analyzing a centrality-\nbased estimator. Unlike the \ufb01rst-timestamp estimator, this reporting centrality estimator uses the\nstructure of the infected subgraph by selecting a candidate source that is close to the center (on the\ngraph) of the observed timestamps. However, it does not explicitly use the observed timestamps.\nAlso unlike the \ufb01rst-timestamp estimator, this centrality-based estimator improves as the degree d\n\n6\n\n\fof the underlying tree increases, with a strictly positive probability of detection as d ! 1. Thus\nthe eavesdropper adversary has an ML probability of detection that scales as \u21e5(1) in d. Intuitively,\nreporting centrality works as follows: for each candidate source v, the estimator counts the number\nof nodes that have reported to the adversary from each of the node v\u2019s adjacent subtrees. It picks a\ncandidate source for which the number of reporting nodes is approximately equal in each subtree.\nTo make this precise, suppose the infected subtree Gt is rooted at w; we use T w\nv to denote the subtree\nof Gt that contains v and all of v\u2019s descendants, with respect to root node w. Consider a random\nvariable Yv(t), which is 1 if node v 2 V has reported to the adversary by time t, and 0 otherwise. We\nlet YT w\nv that have reported to the adversary\nby time t. We use Y (t) =Pv2Vt\nYv(t) to denote the total number of reporting nodes in Gt at time t.\nSimilarly, we use NT w\nv (t)), and\nwe let N (t) denote the total number of infected nodes at time t (N (t)  Y (t)). For each candidate\nsource v, we consider its d neighbors, which comprise the set N (v). We de\ufb01ne a node v\u2019s reporting\ncentrality at time t\u2014denoted Rv(t)\u2014as follows:\n\nv (t) to denote the number of infected nodes in T w\n\nv (t) =Pu2T w\n\nYu(t) denote the number of nodes in T w\n\nv (t)  YT w\n\nv (so NT w\n\nv\n\nRv(t) =(1\n\n0\n\nif maxu2N (v) YT v\notherwise.\n\nu (t) < Y (t)\n\n2\n\n(2)\n\n$\u2217\n\nRv\u21e4(t) = 1\n\nY (t) = 5\nN (t) = 7\n\nThat is, a node\u2019s reporting centrality is 1 iff each of its adjacent subtrees has fewer than\nY (t)/2 reporting nodes. A node is a reporting center iff its reporting centrality is 1.\nThe estimator outputs \u02c6v chosen uniformly from all report-\ning centers. In Figure 4, v\u21e4 is the only reporting center.\nReporting centrality does not use the adversary\u2019s observed\ntimestamps\u2014it only counts the number of reporting nodes\nin each of a node\u2019s adjacent subtrees. This estimator is\ninspired by rumor centrality [30], an ML estimator for the\nsource of a diffusion process under a snapshot adversary.\nRecall that a snapshot adversary sees the infected subgraph\nGt at time t, but it does not learn timestamp information.\nThe next theorem shows that for trees with high degree d,\nreporting centrality has a strictly higher (in an order sense)\nprobability of detection than the \ufb01rst-timestamp estimator;\nFigure 4: Yellow nodes are infected; a\nits probability of detection is strictly positive as d ! 1.\nred outline means the node has reported.\nRv\u21e4(t) = 1 since v\u21e4\u2019s adjacent subtrees\nTheorem 4.2 (Proof in Section C.5) Consider a diffu-\nhave \uf8ff Y (t)/2 = 2.5 reporting nodes.\nsion process of rate  = 1 over a d-regular tree. Sup-\npose this process is observed by an eavesdropper adversary, which sees each node\u2019s timestamp\nwith an independent exponential delay of rate 2 = \u2713, \u2713  1. Then the reporting central-\nity estimator has a (time-dependent) probability of detection P(MRC(\u2327 , G) = v\u21e4) that satis\ufb01es\nd2\u2318\u2318 is a\nlim inf t!1 P(MRC(\u2327 , G) = v\u21e4)  Cd > 0. where Cd = 1  d\u21e31  I1/2\u21e3 1\n\nd2 , 1 + 1\nconstant that depends only on degree d, and I1/2(a, b) is the regularized incomplete Beta function,\n2 ).\ni.e., the probability a Beta random variable with parameters a and b takes a value in [0, 1\n\nRw(t) = 0\n\n#\n\nTo prove this, we relate two P\u00f2lya urn processes: one that represents the diffusion process over the\nregular tree of honest nodes, and one that describes the full spreading process, which includes both\ndiffusion over the regular tree and random reporting to the adversary. The \ufb01rst urn can be posed as a\nclassic P\u00f2lya urn [10], which has been studied in the context of diffusion [31, 14]. The second urn\ncan be described by an unbalanced generalized P\u00f2lya urn (GPU) with negative coef\ufb01cients [4, 13]\u2014a\nclass of urns that does not typically appear in the study of diffusion (to the best of our knowledge).\nAs a side note, this approach can be used to analyze other epidemic source-\ufb01nding problems that have\npreviously evaded analysis, as we show in Appendix B. Notice that the constant Cd in Theorem 4.2\ndoes not depend on \u2713\u2014this is because the reporting centrality estimator makes no use of timestamp\ninformation, so the delays in the timestamps \u2327 do not affect the estimator\u2019s asymptotic behavior.\nSimulation results. To evaluate the lower bound in Theorem 4.2, we simulate reporting centrality\non diffusion over regular trees. Figure 5 illustrates the empirical performance of reporting centrality\naveraged over 4,000 trials, compared to the theoretical lower bound on the liminf. The estimator is\n\n7\n\n\f1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\nn\no\ni\nt\nc\ne\nt\ne\nD\n\nf\no\n\n.\n\nb\no\nr\nP\n\n2\n\n3\n\n4\n\nFirst-timestamp, theoretical\nFirst-timestamp, simulated\nReporting centrality, theoretical\nReporting centrality, simulated\n\n6\n\n7\n\n8\n\n9\n\n10\n\n5\n\nDegree, d\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\nn\no\ni\nt\nc\ne\nt\ne\nD\n\nf\no\n\n.\n\nb\no\nr\nP\n\nTrickle, Theoretical (Lower bound)\nTrickle, Simulated (Lower bound)\nTrickle, Simulated (Exact)\nDiffusion, Theoretical\nDiffusion, Simulated\n15\n\n10\n\n20\n\n0\n\n5\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\nn\no\ni\nt\nc\ne\nt\ne\nD\n\nf\no\n\n.\n\nb\no\nr\nP\n\nTrickle, Theoretical lower bound\nTrickle, Simulated\nTrickle, Theoretical lower bound (d=2)\nDiffusion, Theoretical\nDiffusion, Simulation\n10\n\n15\n\n20\n\n0\n\n5\n\nEavesdropper connections, \u2713\n\nEavesdropper connections, \u2713\n\nFigure 5: First-timestamp vs. re-\nporting centrality on diffusion\nover regular trees, theoretically\nand simulated. \u2713 = 1, t = d+2.\n\nFigure 6:\nComparison of\ntrickle and diffusion under the\n\ufb01rst-timestamp estimator on\n4-regular trees.\n\nFigure 7: Trickle vs. diffusion\nunder the \ufb01rst-timestamp estima-\ntor, simulated on a snapshot of\nthe real Bitcoin network [20].\n\nTable 1: Probability of detection on a d-regular tree. The adversary has \u2713 connections per server.\n\nFirst-\n\nTimestamp\n\nMaximum-\nLikelihood\n\nAll \u2713\n\u2713 = 1\nAll \u2713\n\u2713 = 1\n\nd log 2\n\nTrickle\n\u2713[Ei(2d log \u21e2)Ei(log \u21e2)]\nlog(d)\n\nd log(2) + o\u21e3 log d\n1  d\n1  d\n\n2(\u2713+d)\n\n2(d+1)\n\n(Thm 3.1)\n\nd \u2318 (Sec. 3.1)\n\n(Thm 3.3)\n(Thm. 3.3)\n\nDiffusion\n\n\u2713\n\n\u2713\nlog(d1)\n(d2)\n\nd2 log d+\u27132\n1  d\u21e31  I1/2\u21e3 1\n\n (Thm. 4.1)\n(Thm. 4.1)\nd2\u2318\u2318\nd2 , 1 + 1\n\n(Thm. 4.2)\n\nrun at time t = d + 2. Our simulations are run up to degree d = 5 due to computational constraints,\nsince the infected subgraph grows exponentially in the degree of the tree. By d = 5, reporting\ncentrality reaches the theoretical lower bound on the limiting detection probability.\nFor diffusion, neither lower bound on the \ufb01rst-timestamp or reporting centrality estimator strictly\noutperforms the other. Figure 5 compares the two estimators as a function of degree d. We observe\nthat reporting centrality outstrips \ufb01rst-timestamp estimation for trees of degree 9 and higher; since\nour theoretical result is only a lower bound on the performance of reporting centrality, the transition\nmay occur at even smaller d. Empirically, the true Bitcoin graph is approximately 8-regular [20], a\nregime in which we expect reporting centrality to perform similarly to the \ufb01rst-timestamp estimator.\n\n5 Discussion\n\n2, whereas for diffusion, it is approximately 0.307.\n\nTable 1 summarizes our theoretical results for trickle and diffusion. The probabilities of detection\nfor trickle and diffusion are similar, particularly when \u2713 = 1. Although the maximum-likelihood\nresults are dif\ufb01cult to compare visually, they both approach a positive constant as d, t ! 1; for\ntrickle propagation, that constant is 1\nThese results are asymptotic in degree d. In practice, the underlying Bitcoin graph is \ufb01xed; the\nonly variable quantity is the adversary\u2019s resources, represented by \u2713. Figure 6 compares analytical\nexpressions and simulations for 4-regular trees under the \ufb01rst-timestamp estimator (as we lack an ML\nestimator on general graphs), as a function of \u2713. It suggests nearly identical detection probabilities\nfor diffusion and trickle on regular trees; while our theoretical prediction for diffusion is exact, our\nlower bound on trickle is loose since d is small.\nTo validate our decision to analyze regular trees, we simulate trickle and diffusion on a 2015 snapshot\nof the Bitcoin network [20]. Figure 7 compares these results as a function of \u2713, for the \ufb01rst-timestamp\nestimator. Unless speci\ufb01ed otherwise, theoretical curves are calculated for a regular tree with d = 8,\nthe mean degree of our dataset. Diffusion performs close to the theoretical prediction; this is because\nwith high probability, the \ufb01rst-timestamp estimator uses only on a local neighborhood to estimate v\u21e4,\nand the Bitcoin graph is locally tree-like. However, our trickle lower bound remains loose. This is\npartially due to simultaneous reporting events, but the main contributing factor seems to be graph\nirregularity. Understanding this effect more carefully is an interesting question for future work.\n\n8\n\n\fIn summary, trickle and diffusion have similar probabilities of detection, both in an asymptotic-order\nsense and numerically. We have analyzed the canonical class of d-regular trees and simulated these\nprotocols on a real Bitcoin graph topology. Our results omit certain details of the spreading protocols,\n(Sec. 2); extending the analysis to include these details is practically relevant.\n\nReferences\n[1] Bitcoin core commit 5400ef6.\n\n5400ef6bcb9d243b2b21697775aa6491115420f3.\n\nhttps://github.com/bitcoin/bitcoin/commit/\n\n[2] Bitcoin core integration/staging tree. https://github.com/bitcoin/bitcoin.\n[3] Elli Androulaki, Ghassan O Karame, Marc Roeschlin, Tobias Scherer, and Srdjan Capkun.\nEvaluating user privacy in bitcoin. In International Conference on Financial Cryptography and\nData Security, pages 34\u201351. Springer, 2013.\n\n[4] Krishna B Athreya and Peter E Ney. Branching processes, volume 196. Springer Science &\n\nBusiness Media, 2012.\n\n[5] Carl M Bender and Steven A Orszag. Advanced mathematical methods for scientists and\n\nengineers I. Springer Science & Business Media, 1999.\n\n[6] Alex Biryukov, Dmitry Khovratovich, and Ivan Pustogarov. Deanonymisation of clients in\nbitcoin p2p network. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and\nCommunications Security, pages 15\u201329. ACM, 2014.\n\n[7] Alex Biryukov and Ivan Pustogarov. Bitcoin over tor isn\u2019t a good idea. In 2015 IEEE Symposium\n\non Security and Privacy, pages 122\u2013134. IEEE, 2015.\n\n[8] Arnon Boneh and Micha Hofri. The coupon-collector problem revisited?a survey of engineering\n\nproblems and computational methods. Stochastic Models, 13(1):39\u201366, 1997.\n\n[9] Zhen Chen, Kai Zhu, and Lei Ying. Detecting multiple information sources in networks under\n\nthe sir model. IEEE Transactions on Network Science and Engineering, 3(1):17\u201331, 2016.\n\n[10] Florian Eggenberger and George P\u00f3lya. \u00dcber die statistik verketteter vorg\u00e4nge. ZAMM-Journal\nof Applied Mathematics and Mechanics/Zeitschrift f\u00fcr Angewandte Mathematik und Mechanik,\n3(4):279\u2013289, 1923.\n\n[11] G. Fanti, P. Kairouz, S. Oh, K. Ramchandran, and P. Viswanath. Metadata-aware anonymous\n\nmessaging. In ICML, 2015.\n\n[12] V. Fioriti and M. Chinnici. Predicting the sources of an outbreak with a spectral technique.\n\narXiv:1211.2333, 2012.\n\n[13] Svante Janson. Functional limit theorems for multitype branching processes and generalized\n\np\u00f3lya urns. Stochastic Processes and their Applications, 110(2):177\u2013245, 2004.\n\n[14] Justin Khim and Po-Ling Loh. Con\ufb01dence sets for the source of a diffusion in regular trees.\n\narXiv preprint arXiv:1510.05461, 2015.\n\n[15] Philip Koshy, Diana Koshy, and Patrick McDaniel. An analysis of anonymity in bitcoin using\np2p network traf\ufb01c. In International Conference on Financial Cryptography and Data Security,\npages 469\u2013485. Springer, 2014.\n\n[16] A. Y. Lokhov, M. M\u00e9zard, H. Ohta, and L. Zdeborov\u00e1. Inferring the origin of an epidemic with\n\ndynamic message-passing algorithm. arXiv preprint arXiv:1303.5315, 2013.\n\n[17] Paul Mah.\n\nTop\n\n5\n\nvpn\n\nrity,\ntop-5-vpn-services-for-personal-privacy-and-security.html.\n\n2016.\n\nservices\n\nsecu-\nhttp://www.cio.com/article/3152904/security/\n\npersonal\n\nprivacy\n\nand\n\nfor\n\n[18] Hosam Mahmoud. P\u00f3lya urn models. CRC press, 2008.\n\n9\n\n\f[19] Sarah Meiklejohn, Marjori Pomarole, Grant Jordan, Kirill Levchenko, Damon McCoy, Geof-\nfrey M Voelker, and Stefan Savage. A \ufb01stful of bitcoins: characterizing payments among men\nwith no names. In Proceedings of the 2013 conference on Internet measurement conference,\npages 127\u2013140. ACM, 2013.\n\n[20] Andrew Miller, James Litton, Andrew Pachulski, Neal Gupta, Dave Levin, Neil Spring, and\n\nBobby Bhattacharjee. Discovering bitcoins public topology and in\ufb02uential nodes, 2015.\n\n[21] Chris Milling, Constantine Caramanis, Shie Mannor, and Sanjay Shakkottai. Network forensics:\nrandom infection vs spreading epidemic. ACM SIGMETRICS Performance Evaluation Review,\n40(1):223\u2013234, 2012.\n\n[22] David Z. Morris. Legal sparring continues in bitcoin user?s battle with irs tax sweep, 2017.\n\nhttp://fortune.com/2017/01/01/bitcoin-irs-tax-sweep-user-battle/.\n\n[23] Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system, 2008.\n[24] Micha Ober, Stefan Katzenbeisser, and Kay Hamacher. Structure and anonymity of the bitcoin\n\ntransaction graph. Future internet, 5(2):237\u2013250, 2013.\n\n[25] P. C. Pinto, P. Thiran, and M. Vetterli. Locating the source of diffusion in large-scale networks.\n\nPhysical review letters, 109(6):068702, 2012.\n\n[26] B. A. Prakash, J. Vreeken, and C. Faloutsos. Spotting culprits in epidemics: How many and\n\nwhich ones? In ICDM, volume 12, pages 11\u201320, 2012.\n\n[27] Fergal Reid and Martin Harrigan. An analysis of anonymity in the bitcoin system. In Security\n\nand privacy in social networks, pages 197\u2013223. Springer, 2013.\n\n[28] Dorit Ron and Adi Shamir. Quantitative analysis of the full bitcoin transaction graph. In\nInternational Conference on Financial Cryptography and Data Security, pages 6\u201324. Springer,\n2013.\n\n[29] D. Shah and T. Zaman. Detecting sources of computer viruses in networks:\n\ntheory and\nexperiment. In ACM SIGMETRICS Performance Evaluation Review, volume 38, pages 203\u2013\n214. ACM, 2010.\n\n[30] D. Shah and T. Zaman. Rumors in a network: Who\u2019s the culprit? Information Theory, IEEE\n\nTransactions on, 57:5163\u20135181, Aug 2011.\n\n[31] D. Shah and T. Zaman. Rumor centrality: a universal source detector. In ACM SIGMETRICS\n\nPerformance Evaluation Review, volume 40, pages 199\u2013210. ACM, 2012.\n\n[32] Z. Wang, W. Dong, W. Zhang, and C.W. Tan. Rumor source detection with multiple observations:\n\nFundamental limits and algorithms. In ACM SIGMETRICS, 2014.\n\n[33] Eric W Weisstein. Euler-mascheroni constant. 2002.\n[34] K. Zhu and L. Ying. A robust information source estimator with sparse observations. arXiv\n\npreprint arXiv:1309.4846, 2013.\n\n[35] Kai Zhu and Lei Ying. A robust information source estimator with sparse observations. Compu-\n\ntational Social Networks, 1(1):1, 2014.\n\n10\n\n\f", "award": [], "sourceid": 882, "authors": [{"given_name": "Giulia", "family_name": "Fanti", "institution": "Carnegie Mellon University"}, {"given_name": "Pramod", "family_name": "Viswanath", "institution": "UIUC"}]}