{"title": "(Nearly) Efficient Algorithms for the Graph Matching Problem on Correlated Random Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 9190, "page_last": 9198, "abstract": "We consider the graph matching/similarity problem of determining how similar two given graphs $G_0,G_1$ are and recovering the permutation $\\pi$ on the vertices of $G_1$ that minimizes the symmetric difference between the edges of $G_0$ and $\\pi(G_1)$. Graph matching/similarity has applications for pattern matching, vision, social network anonymization, malware analysis, and more. We give the first efficient algorithms proven to succeed in the correlated Erd\u00f6s-R\u00e9nyi model (Pedarsani and Grossglauser, 2011). Specifically, we give a polynomial time algorithm for the graph similarity/hypothesis testing task which works for every constant level of correlation between the two graphs that can be arbitrarily close to zero. We also give a quasi-polynomial ($n^{O(\\log n)}$ time) algorithm for the graph matching task of recovering the permutation minimizing the symmetric difference in this model. This is the first algorithm to do so without requiring as additional input a ``seed'' of the values of the ground truth permutation on at least $n^{\\Omega(1)}$ vertices. Our algorithms follow a general framework of counting the occurrences of subgraphs from a particular family of graphs allowing for tradeoffs between efficiency and accuracy.", "full_text": "(Nearly) Ef\ufb01cient Algorithms for the Graph Matching\n\nProblem on Correlated Random Graphs\n\nSchool of Engineering and Applied Science\n\nSchool of Engineering and Applied Science\n\nBoaz Barak\u2217\n\nHarvard University\n\nCambridge, MA, 02138\n\nb@boazbarak.org\n\nZhixian Lei\u2217\n\nChi-Ning Chou\u2217\n\nHarvard University\n\nCambridge, MA, 02138\n\nchiningchou@g.harvard.edu\n\nTselil Schramm\u2217\n\nSchool of Engineering and Applied Science\n\nSchool of Engineering and Applied Science\n\nHarvard University\n\nCambridge, MA, 02138\n\nleizhixian.research@gmail.com\n\nHarvard University\n\nCambridge, MA, 02138\n\ntselil@seas.harvard.edu\n\nYueqi Sheng\u2217\n\nSchool of Engineering and Applied Science\n\nHarvard University\n\nCambridge, MA, 02138\nysheng@g.harvard.edu.\n\nAbstract\n\nWe consider the graph matching/similarity problem of determining how similar two\ngiven graphs G0, G1 are and recovering the permutation \u03c0 on the vertices of G1 that\nminimizes the symmetric difference between the edges of G0 and \u03c0(G1). Graph\nmatching/similarity has applications for pattern matching, computer vision, social\nnetwork anonymization, malware analysis, and more. We give the \ufb01rst ef\ufb01cient\nalgorithms proven to succeed in the correlated Erd\u00f6s-R\u00e9nyi model (Pedarsani\nand Grossglauser, 2011). Speci\ufb01cally, we give a polynomial time algorithm for\nthe graph similarity/hypothesis testing task which works for every constant level\nof correlation between the two graphs that can be arbitrarily close to zero. We\nalso give a quasi-polynomial (nO(log n) time) algorithm for the graph matching\ntask of recovering the permutation minimizing the symmetric difference in this\nmodel. This is the \ufb01rst algorithm to do so without requiring as additional input a\n\u201cseed\u201d of the values of the ground truth permutation on at least n\u2126(1) vertices. Our\nalgorithms follow a general framework of counting the occurrences of subgraphs\nfrom a particular family of graphs allowing for tradeoffs between ef\ufb01ciency and\naccuracy.\n\n1\n\nIntroduction\n\nThe graph matching and graph similarity problems are well-studied computational problems with\napplications in a great many areas. Some examples include machine learning [1], computer vi-\nsion [2], pattern recognition [3], computational biology [4, 5], social network analysis [6], de-\n\n\u2217Supported by NSF awards CCF 1565264 and CNS 1618026.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fanonymization [7], and malware detection [8].2 The graph matching problem is the task of computing,\ngiven a pair (G0, G1) of n vertex graphs, the permutation\n\n(1)\n\n\u03c0\u2217 = arg min\n\u03c0\u2208Sn\n\n(cid:107)G0 \u2212 \u03c0(G1)(cid:107)0\n\nwhere we identify the graphs with their adjacency matrices, and write \u03c0(G1) for the matrix obtained by\npermuting the rows and columns according to \u03c0 (i.e., the matrix P (cid:62)G1P where P is the permutation\nmatrix corresponding to \u03c0). The graph similarity problem is to merely determine whether or not G0\nis similar to G1 or more generally to obtain an ef\ufb01ciently computable distance measure on G0 and\nG1 that provides a rough approximation to min\u03c0\u2208Sn (cid:107)G0 \u2212 \u03c0(G1)(cid:107)0.\nIn this paper we obtain new algorithms with provable guarantees for both problems. These problems\nare NP hard in the worst case3 and hence our focus is on average case complexity and speci\ufb01cally\nthe correlated Erd\u00f6s-R\u00e9nyi model introduced by [11] and studied in [6, 12, 13, 14, 15, 16]. For\nn a positive integer and 0 < p, \u03b3 < 1, the correlated Erd\u00f6s-R\u00e9nyi model with parameters n, p, \u03b3\nis the following distribution over triples (G0, G1, \u03c0) where G0, G1 are n-vertex graphs and \u03c0 is\npermutation on [n]: (i) We sample a base graph B from the Erd\u00f6s-R\u00e9nyi random graph distribution\nG(n, p), (ii) We let G, G(cid:48) to be two independent random subgraphs of B, where every edge from\nB is kept in G and G(cid:48) with probability \u03b3 independently, (iii) We choose a random permutation \u03c0\nand output (G, \u03c0(G(cid:48)), \u03c0).4 We denote this distribution by Dstruct(n, p; \u03b3). We say that (G0, G1) are\nsampled from Dstruct(n, p; \u03b3) if they are obtained by sampling (G0, G1, \u03c0) from this distribution and\ndiscarding the permutation \u03c0. We use Dnull(n, p; \u03b3) for the product distribution G(n, p\u03b3) \u00d7 G(n, p\u03b3).\nNote that the marginals over G0, G1 are the same in both Dstruct and Dnull but the two graphs are\ncorrelated in the former distribution and independent in the latter. We consider the following two\ncomputational problems:\nGraph similarity: hypothesis testing. Given (G0, G1) sampled either from Dstruct(n, p; \u03b3) or from\nDnull(n, p; \u03b3). The goal is to distinguish which distribution the input (G0, G1) was sampled\nfrom. Graph similarity (for general models) has been proposed as a tool for malware\ndetection [17, 18], chemical structure similarity [19, 20], comparing biological networks [21]\nand more.\nGraph matching: recovery. Given (G0, G1) sampled from Dstruct(n, p; \u03b3), the goal is to recover the\npermutation \u03c0. Graph matching has a long history in pattern recognition [3], social network\nde-anonymization [7] and more.\n\n1.1 Our contributions\nIt is known as long as p\u03b32 (cid:29) log n/n, if (G0, G1, \u03c0) is drawn from Dstruct(n, p\u03b3) then \u03c0 will be the\nminimizer of the right-hand side of (1), but prior to this work it was not known whether there is an\nef\ufb01cient algorithm to recover \u03c0 (see Section 1.2 for related work). In this work we give algorithms for\nboth the hypothesis testing and recovery problems on the correlated Erd\u00f6s-R\u00e9nyi model G(n, p; \u03b3)\nfor every constant (and even slightly sub-constant) \u03b3 and a wide range of p.\nTheorem 1.1 (Hypothesis testing). For every \u0001 > 0, suf\ufb01ciently small \u03b4 > 0, and \u03b3 > 0 there is\na polynomial time algorithm A that distinguishes with success probability at least 1 \u2212 \u0001 between\nthe case that (G0, G1) are sampled from Dstruct(n, n\u03b4\u22121; \u03b3) and the case that they are sampled from\nDnull(n, n\u03b4\u22121; \u03b3).\nTheorem 1.2 (Recovery). For every \u0001 > 0, suf\ufb01ciently small \u03b4 > 0, and \u03b3 > 0, there is a\nrandomized algorithm A with running time nO(log n) such that with probability at least 1 \u2212 \u0001 over\n(G0, G1, \u03c0\u2217) \u223c Dstruct(n, n\u03b4\u22121; \u03b3) and over the choices of A, we have A(G0, G1) = \u03c0\u2217.\nThese are the \ufb01rst algorithms that run in better than exponential time for these problems (see Table 1).\nWhile the main contribution of this paper is theoretical, we believe that our techniques are of\nindependent interest and applicability beyond the correlated Erd\u00f6s-R\u00e9nyi model. Key to our work\nis the notion of identifying a large family of subgraphs (a \u201c\ufb02ock of black swans\u201d) each of which is\n\n2See the surveys [9, 10], the latter of which is titled \u201cThirty Years of Graph Matching in Pattern Recognition\u201d.\n3Hamiltonian path is NP hard and can be reduced to graph matching by matching the input with a cycle.\n4Some works also studied a more general variant where G0 and G1 use different subsampling parameters\n\n\u03b30, \u03b31. Our work extends to this setting as well but for simplicity we focus on the \u03b30 = \u03b31 case.\n\n2\n\n\fhighly unlikely to occur as a subgraph in a random graph but satisfying some near-independence\nconditions that imply that with high probability some members of the family will occur. The existence\nof such a family is by no means easy to establish\u2014 showing this accounts for much of the technical\nwork in this paper and there are still ranges of parameters for which we conjecture that such families\nexist but have not been able to prove so. However, for any given distribution of graphs, one can\nsearch for subgraphs that will serve as useful features for both graph similarity and graph matching.\n\nCullina & Kivayash\n\nPaper\n\n[15, 16]\n\nAlgorithm\n\nexhaustive search\n\n(information theoretic bound)\n\nYartseva & Grossglauser\n\npercolation from seed set\n\n[12]\n\nThis paper\n\nsubgraph\nmatching\n\nRuntime\nO(n!)\n\nexp(n1\u2212\u03b4\u2212\u0398(\u03b42))\n\nnO(1) for testing\n\nnO(log n) for recovery.\n\nTable 1: Comparison with prior algorithms rigorously analyzed for recovery or testing in the correlated Erd\u00f6s-\nR\u00e9nyi model, when (G0, G1, \u03c0) \u223c Dstruct(n, n\u03b4\u22121; \u03b3) for \u03b4 > 0. Prior algorithms were analyzed in this model\nfor the recovery task which subsumes testing. See related work section for a full discussion.\nRemark 1.3. While we state our results for \u201csuf\ufb01ciently small\u201d \u03b4 they actually hold in a broader\n3 \u2264 \u03b4 < 1). Under a certain combinatorial conjecture our\nsetting (i.e., for 0 < \u03b4 \u2264 1/153 or 2\nalgorithms works for all 0 < \u03b4 < 1, see supplementary material.\n\n1.2 Related work\n\nn\n\nn\n\nThere has been signi\ufb01cant amount of work on the correlated Erd\u00f6s-R\u00e9nyi model. Cullina and\nKivayash [15, 16] precisely characterized the parameters p, \u03b3 for which information theoretic recovery\nis possible. Speci\ufb01cally, they showed recovery is possible (in the information-theoretic sense, via an\nand impossible when p\u03b32 < log n\u2212\u03c9(1)\nexhaustive search over all permutations) if p\u03b32 > log n+\u03c9(1)\n.\nYartseva and Grossglauser [12] analyzed a simple algorithm known as Percolation Graph Matching\n(PGM), which was used successfully by Narayanan and Shmatikov [7] to de-anonymize many real-\nworld networks. (Similar algorithms were also analyzed by [6, 14, 13].) This algorithm starts with\na \"seed set\" S of vertices in G0 that are mapped by \u03c0 to G1, and for which the mapping \u03c0|S is\ngiven. It propagates this information according to a simple percolation, until it recovers the original\npermutation. Yartseva and Grossglauser gave precise characterization of the size of the seed set\nrequired as a function of p and \u03b3 [12]. Speci\ufb01cally, in the case that \u03b3 = \u2126(1) and p = n\u22121+\u03b4 (where\nthe expected degree of G0 and G1 is \u0398(n\u03b4)), the size of the seed set required is |S| = n1\u2212\u03b4\u2212\u0398(\u03b42). In\nthe general setting when one is not given such a seed set, we would require about n|S| steps to obtain\nit by brute force, which yields an exp(n\u2126(1)) time algorithm in this regime. Lyzinski et al. [22] also\ngave negative results for popular convex relaxations for graph matching on random correlated graphs.\nWe use a variant on the PGM algorithm as a component in our work to \u201cboost\u201d an initial partial\npermutation into a the full knowledge. As part of that, we extend the analysis of PGM to show it\nworks even in the case where the partial assignment is noisy and the seed set itself might not be\nrandom but rather can be adversarially chosen, see Lemma 4.2.\nThere have been many works on heuristics for both graph matching and graph similarity (see the\nsurveys [9, 10]). In particular [23, 24, 21, 25, 26] studied the graph similarity problem of deciding\nwhether two graphs are similar to one another. [27, 28, 29, 30] trained a deep neural network to\nextract features of graphs for graph similarity.\n\n2 Approaches and Techniques\n\nIn this section, we illustrate our approach and techniques. For simplicity and concreteness, we set the\nnoise parameter \u03b3 to half, and focus on the hypothesis testing task of distinguishing whether graphs\n(G0, G1) are sampled from Dnull(n, n\u03b4\u22121; 1\nWarm-up: degree sequence. Since graph matching is a noisy version of graph isomorphism, as a\nwarm-up let us consider one of the most common heuristics for graph isomorphism which measures\nsimilarity of the graphs using their degree sequence. Namely, using the vector of sorted degrees of the\n\n2 ) for some small constant \u03b4 > 0.\n\n2 ) or Dstruct(n, n\u03b4\u22121; 1\n\n3\n\n\fvertices in the graph as a feature vector. If G0 and G1 were isomorphic then the two vectors will be\nidentical, while for two independent graphs the vectors are highly likely to differ. While this heuristic\nis quite successful in the setting of (noiseless) graph isomorphism setting in getting at least an initial\nassignment, it completely fails in our noisy setting of the graph matching and similarity problems.\n\u221a\nIntuitively, this is due to the fact degrees in a random graph are highly concentrated (generally of the\nform pn \u00b1 O(\npn)) and so even adding a small constant amount of noise will have a large effect on\nthe order of the vertices in the sorting, hence making corresponding coordinates of the two vectors\nindependent from one another. A similar phenomenon holds for the case where we use the sorted top\neigenvectors of the adjacency matrix as a feature vector. While the degree and eigenvectors are poorly\nsuited for handling noisy graphs, it turns out we can design better features by looking at subgraph\ncounts for carefully chosen families of graphs. This is what we do.\n\n2.1 The \u201cblack swan\u201d approach\nOur approach can be viewed as \u201cusing a \ufb02ock of black swans\u201d. Speci\ufb01cally, for each b \u2208 {0, 1}, we\nmap the graphs G0, G1 into a pair of feature vectors v0, v1 \u2208 Zk as follows: Let H = {H1, . . . , Hk}\nbe a carefully chosen family of small graphs. Next, for b \u2208 {0, 1} and j \u2208 {1, 2, . . . , k}, de\ufb01ne the\njth coordinate of vb to be the number of occurrences of the graph Hj as being a subgraph of Gb.5 We\nchoose the family H to satisfy the following two conditions:\n\"Black swan\": For every H \u2208 H, the probability that H occurs as a subgraph of a random graph G\nPairwise independence (informal): For H (cid:54)= H(cid:48) in H, the probability both H and H(cid:48) both occur\nas subgraphs in a random graph G from G(n, p) is up to a constant factor the product of the\nprobabilities that each one of them occurs individually.\n\nfrom G(n, p) is a small number \u00b5 (cid:28) 1.\n\nj and v1\n\nBefore going to the details of the technical properties of black swans, let us \ufb01rst take a look at why\nthis would be useful for the hypothesis testing problem. Let\u2019s assume for simplicity that all the graphs\nin H have e edges for some constant e. If G0, G1 are \u03b3 correlated then for every j \u2208 {1, 2, . . . , k},\nthe coordinates v0\nj will have a correlation of \u03b32e. In contrast, if G0, G1 are independently\nchosen then v0 and v1 are completely independent and hence v0\nj have zero correlation. The\nnumber \u03b3e is very small, but the pairwise independence condition implies that if the size |H| of the\nfamily is much larger than (1/\u03b3)2e then the vectors v0 and v1 will have a signi\ufb01cantly larger inner\nproduct in the correlated case than they do in the null case. We instantiate the above idea into a\nhypothesis testing algorithm in Section 3.\nRemark 2.1 (Black-swan based algorithm for recovery). The above approach can be extended to the\nrecovery problem as well. The idea is that for every vertex i of Gb we de\ufb01ne a vector vb,i \u2208 Zk such\nthat for all (cid:96) \u2208 [k], vb,i\nis equal to the number of subgraphs of Gb isomorphic to H(cid:96) that touch the\nvertex i. The intuition is that for vertices i of G0 and j of G1, the vectors v0,i and v1,j are much more\nlikely to have signi\ufb01cant inner product if \u03c0(i) = j, this can be used to obtain partial information on\nthe permutation that can later be \u201cboosted\u201d to recover the full permutation. We instantiate the above\nidea into a recovery algorithm in Section 4.\n\nj and v1\n\n(cid:96)\n\n2.2 Constructing the black swan family\nWe now describe more precisely the properties that our family H of \u201cblack swans\u201d or test graphs\nneeds to satisfy so the above algorithm will succeed. It is encapsulated in the following theorem:6\nTheorem 2.2 (General overview of test graph properties). For any rational scalar d \u2208 (2, 2 + 1\n76 )\nor d \u2208 Z\u22653 or d \u2265 6, and integer v0 there exists v \u2265 v0 and set Hv\n\nd of v-vertex graphs s.t.:\n\n1. (Low likelihood of appearing) Every H \u2208 Hv\n\nedges of H is e = dv/2.\n\nd has average degree d. That is, the number of\n\n5More formally, vb\n\nj = XHj (Gb) where XH (G) is the number of injective homomorphisms of H to G,\n\ndivided by the number of automorphisms of H.\n\n6The range of values of p our algorithm is proven to succeed for corresponds to the degrees achievable in\nTheorem 2.2. We conjecture that a family achieving these properties can be obtained with any density e/v > 1,\nwhich would extend our analysis to p = n1\u2212\u03b4 for all \u03b4 \u2208 (0, 1) (see the supplementary materials).\n\n4\n\n\f2. (Strong strict balance) For every H \u2208 Hv\n\nd and induced subgraph H(cid:48) of H with e(1 \u2212 \u0001)\nedges and v(cid:48) vertices satis\ufb01es e(cid:48)/v(cid:48) < e/v \u2212 \u03b7 for a constant \u03b7 depending only on \u0001 and d .7\n\n3. Every H \u2208 Hv\n4. (Pairwise near independence) For every pair of distinct graphs H, H(cid:48) \u2208 Hv\n\nd has no non-trivial automorphisms.\n\nd if J is a shared\nsubgraph of H and H(cid:48) of e(cid:48)(cid:48) edges and v(cid:48)(cid:48) vertices then e(cid:48)(cid:48)/v(cid:48)(cid:48) < e/v \u2212 \u03b7(cid:48) where \u03b7(cid:48) is a\nconstant depending only on d.\n\n5. (Largeness) The size of the family is |H| = vcv where c is a constant depending only on d.\n\nThe proof of Theorem 2.2 is quite involved, and we leave it to the supplementary materials. Here,\nwe sketch the construction of Hv\nd where d = 2 + \u03b4 for a small constant \u03b4 > 0. This is the most\ninteresting parameter regime, as it corresponds to the sparse graph case where the degree of G0, G1\nis \u223c n\u03b4. We can express the number (1 \u2212 \u03b4)/1.5\u03b4 as a convex combination k\u03b1 + (k + 1)(1 \u2212 \u03b1)\nof two integers k, k + 1. We choose a large enough integer v so that \u03b4v, 1.5\u03b4v and \u03b11.5\u03b4v are all\nintegers. Now we choose a random three-regular graph H on \u03b4v vertices (and hence 1.5\u03b4v edges),\npick 1.5\u03b1\u03b4v of the edges of H uniformly at random, and replace them with paths of length k (i.e.,\nsubdivide the edge with k vertices) and replace the remaining (1\u2212 \u03b1)\u00b7 1.5\u03b4v edges with k + 1 length\npaths. The resulting graph H(cid:48) will have average degree 2 + \u03b4 as desired. The bulk of the analysis is\nto prove that with high probability the graph H(cid:48) will satisfy the strong strict balance property, and\nmoreover we can repeat this process v\u2126(v) times and get a family of graphs, every pair of which\nsatis\ufb01es the pairwise near independence property. See Figure 1 for an example.\n\nFigure 1: An example of the construction where d = 2 + \u03b4 and k = 0.\n\n3 Algorithm for Hypothesis Testing\n\nIn this section, we describe the algorithm for the hypothesis testing based on the \u201cblack swan\u201d\napproach introduced in Section 2. Let H be the family of graphs constructed in Theorem 2.2, we\nde\ufb01ne the following correlation polynomial.\n\n(cid:0)XH (G0) \u2212 EG\u223cG(n,p\u03b3)XH (G)(cid:1)(cid:0)XH (G1) \u2212 EG\u223cG(n,p\u03b3)XH (G)(cid:1) .\n\nPH(G0, G1) =\n\n1\n|H|\n\n(cid:88)\n\nH\u2208H\n\nIntuitively, the expectation of PH(G0, G1) is zero under Dnull but large under Dstruct. Speci\ufb01cally, we\nprove the following theorem:\nTheorem 3.1. For any n large enough, suf\ufb01ciently small \u03b4 > 0, and any \u03b3 \u2208 (0, 1), let H = Hv\nd| \u2265 (400/\u03b32)dv. Then EDnull[PH(G0, G1)] =\nobtained from Theorem 2.2 where d = 2\n0, and EDstruct [PH(G0, G1)] \u2265 40 \u00b7 max\nwhere all distributions above are for n vertices, p = n\u03b4\u22121 and noise \u03b3.\n\nVarDnull (PH(G0, G1))1/2 , VarDstruct (PH(G0, G1))1/2(cid:17)\n\n(cid:16)\n1\u2212\u03b4 and |Hv\n\nd\n\nThe proof of Theorem 3.1 is provided in the supplementary materials. The degree of the polynomial\nPH(G0, G1) is 2e where e is the number of edges in any member of the family, and so its number\nof monomials (and hence computation time) will be nO(e) = nO(1) where the constant in the\nO(1) depends on the size of the representation of (1 \u2212 \u03b4)\u22121 as a ratio of two integers. Combining\nTheorem 3.1 with Chebyshev\u2019s inequality, the following algorithm solves the hypothesis testing\nproblem in the parameter regime stated in Theorem 1.1.\n\n7This condition is a strengthening of the \u201cstrict balance\u201d condition in the random graph literature [31].\n\n5\n\n\f1\u2212\u03b4 .\n\nDstruct(n, p; \u03b3).\n\nAlgorithm 1 HYPOTHESISTESTING\nInput: Parameters n, p, \u03b3 where p = n\u03b4\u22121. Graphs G0, G1 sampled from either Dnull(n, p; \u03b3) or\nOutput: \u201c(G0, G1) came from Dnull\u201d or \u201c(G0, G1) came from Dstruct\u201d.\n1: d \u2190 2\n2: Choose v be a suf\ufb01ciently large even number such that vc > 400/\u03b32 where c is the constant\nd where Hv\n3: H \u2190 Hv\n4: Compute \u00b5struct \u2190 E(G(cid:48)\n1)\u223cDstruct(n,p;\u03b3)[PH(G(cid:48)\n5: if PH(G0, G1) > 1\n6: else Output \u201c(G0, G1) came from Dnull\u201d. end if\n\nd is obtained from Theorem 2.2.\n0, G(cid:48)\n3 \u00b5struct then Output \u201c(G0, G1) came from Dstruct\u201d.\n\nfrom Theorem 2.2 so that |Hv\n\nd| \u2265 v cd\n2 v.\n\n0,G(cid:48)\n\n1)].\n\n4 Algorithm for Recovery\n\nIn this section we present our algorithm for the recovery (i.e., graph matching) task. All proofs are\nprovided in the supplemantary material. Our algorithm follows the following general template:\n\nAlgorithm 2 RECOVERY\nInput: Parameters n, p, \u03b3 and graphs G0, G1 sampled from Dstruct(n, p; \u03b3).\nOutput: A permutation \u03c0 \u2208 Sn.\n1: H \u2190 INITIALIZERECOVERY(n, p, \u03b3).\nd(cid:48) by Theorem 2.2.\n2: \u03c00 \u2190 PARTIALASSIGNMENT(n, p, \u03b3, G0, G1,H).\n(cid:46) Find an initial partial assignment \u03c00.\n3: \u03c0 \u2190 BOOSTING(n, p, \u03b3, G0, G1, \u03c00). (cid:46) Boost the partial assignment \u03c00 to \ufb01nal assignment \u03c0.\n4: return \u03c0.\n\n(cid:46) Initialize a graph family H = Hv\n\nThere are three steps in the above general template algorithm RECOVERY, each of them is of\nindependent interest. In the \ufb01rst step, one construct a family of subgraphs of nice structure so that in\nthe second step these subgraphs can be used to ef\ufb01ciently come up with a partial assignment \u03c00 to\nthe recovery problem. A partial assignment correctly matches a good fraction of vertices between G0\nand G1, however, one does not know which vertices are correctly matched. Thus, in the last step, the\nboosting algorithm transforms an arbitrary partial assignment \u03c00 to a \ufb01nal assignment \u03c0 that correctly\nmatches every vertex. The main contribution of this paper lies in the \ufb01rst two steps which use the\nblack swan approach while the last step is a variant of the previous seed-set based algorithms. In the\nfollowing, we instantiate RECOVERY using the test graph family constructed in Theorem 2.2 and\nprove Theorem 1.2.\n\nStep 1: Construct graph family Here we describe the algorithm INITIALIZERECOVERY as\nfollows. For p = n\u03b4\u22121, if 0 < \u03b4 < 1\n153, choose v = \u0398(log n), to be the smallest even integer so that\n3 \u2264 \u03b4 < 1, choose\n\u03bbv is also an integer, for some \u03bb \u2208 ( 2\u03b4\n1\u2212\u03b4 + log log n\n1\u2212\u03b4 , 2\u03b4\nv = \u0398(log n), to be the smallest even integer so that there is some d(cid:48) \u2208 ( 2\u03b4\n4 log n ), so that\n(d(cid:48) \u2212 (cid:98)d(cid:48)(cid:99)v) is also an integer. Finally, pick H to be Hv\nd(cid:48) is obtained from Theorem 2.2.\n\nlog n ) and set d(cid:48) = 2 + \u03bb. If 2\n1\u2212\u03b4 , 2\u03b4\n\nd(cid:48) where Hv\n\n1\u2212\u03b4 + log log n\n\nn\n\nStep 2: Partial assignment The second part of the recovery algorithm is a procedure in \ufb01nding a\nnoisy seed set. Speci\ufb01cally, if (G0, G1, \u03c0\u2217) are sampled from Dstruct(n, p; \u03b3), and 0 < \u03b8, \u03b7 \u2264 1 are\nsome constants then an (\u03b8, \u03b7) partial assignment is a partial function \u03c0 : V (G0) \u2192 V (G1) that is\none-to-one de\ufb01ned on at least \u03b8 fraction of the inputs s..t. for at least \u03b7 fraction of the inputs u on\nwhich \u03c0 is de\ufb01ned, \u03c0(u) = \u03c0\u2217(u). We prove that algorithm PARTIALASSIGNMENT below gives a\nO(log log n) , 1 \u2212 o(1))-partial assignment with probability 1 \u2212 o(1) over the Erd\u00f6s-R\u00e9nyi model and\n(\nthe randomness of the algorithm.\nLemma 4.1. Suppose that (G0, G1) \u223c Dstruct(n, p; \u03b3) and H = Hv\nd from INITIALIZERECOVERY.\nv1/8 )-\nThen under the conditions of Theorem 1.2, PARTIALASSIGNMENT outputs a ( n\npartial assignment with probability 1 \u2212 o(1) over the choice of (G0, G1) \u223c Dstruct(n, p; \u03b3) and the\nrandomness of the algorithm.\n\nlog v , 1 \u2212 1\n\n6\n\n\fAlgorithm 3 PARTIALASSIGNMENT\nInput: Parameters n, p, \u03b3, graphs G0, G1 sampled from Dstruct(n, p; \u03b3), and a family of graphs H.\nOutput: A permutation \u03c00 \u2208 Sn.\n1: v \u2190 |V (H)|, e \u2190 |E(H)|, \u2200H \u2208 H.\n2: d(cid:48) \u2190 2e\nv .\n3: \u03c00(u) \u2190 \u2205 for all u \u2208 V (G0).\n4: for u \u2208 V (G0) do\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n12: return \u03c00.\n\nPick H \u2190 Hu at random.\nw \u2190 the corresponding vertex of u in the copy of H in G1.\nif \u00ac\u2203u(cid:48) (cid:54)= u such that \u03c00(u(cid:48)) = w then \u03c00(u) \u2190 w. end if\n\nHu \u2190 {H \u2208 H : u is incident to a copy of H in G0 and H appears in G1}.\nif |Hu| \u2265 1\n\n2|H| \u00b7 v \u00b7 nv\u22121(p\u03b32)e then\n\nend if\n\nStep 3: Boosting Finally, in the last step of the recovery algorithm, we boost the partial assignment\nto a full permutation from V (G0) to V (G1). This step is based on the \u201cPercolation Graph Matching\u201c\nused in works such as [12, 13, 32, 33, 14]. However, we need a stronger analysis of this step, since\nthe partial knowledge obtained from PARTIALASSIGNMENT can be noisy and (more importantly)\nmight have arbitrary correlation with the random graph, and hence we need to assume that it might\nO(log log n) , 1 \u2212 o(1)) partial\nbe adversarially chosen. Speci\ufb01cally, we show that we can boost an (\nassignment to a the full ground truth:\nLemma 4.2 (Boosting from partial knowledge). Let p, \u03b3, n, \u03b7, c, \u03b8 be such that p\u03b3n \u2265 logc n\nfor c > 1, \u03b7\u03b8 = o(\u03b32) and \u03b8 = \u2126(log1\u2212c n). Then with probability 1 \u2212 o(1) over the choice\nof (G0, G1, \u03c0\u2217) from Dstruct(n, p; \u03b3), if BOOSTING is given G0, G1 and any (\u03b8n, 1 \u2212 \u03b7) partial\nassignment \u03c0, then it outputs the ground truth permutation \u03c0\u2217.\n\nn\n\n2: \u2206 \u2190(cid:4)\u03b8\u03b32np/100(cid:5).\n6: \u2206(cid:48) \u2190(cid:4)\u03b32np/100(cid:5).\n\nAlgorithm 4 BOOSTING\nInput: Parameters n, p, \u03b3, graphs G0, G1 sampled from Dstruct(n, p; \u03b3), a partial assignment \u03c00 \u2208 Sn.\nOutput: A permutation \u03c0 \u2208 Sn.\n1: (\u03b8, \u03b7) \u2190 Lemma 4.1 and \u03c0 \u2190 \u03c00.\n3: for u \u2208 V (G0), w \u2208 V (G1) do N (u, w) \u2190 |{u(cid:48) \u2208 V (G0) : u \u223c u(cid:48), \u03c0(u(cid:48)) \u223c w}|. end for\n4: while u \u2208 V (G0) where \u03c0(u) = \u2205 and \u2203w \u2208 V (G1), N (u, w) \u2265 \u2206 do \u03c0(u) \u2190 w. end while\n5: if \u03c0 is not a permutation then Complete \u03c0 arbitrarily. end if\n7: while \u2203u \u2208 V (G0), w \u2208 V (G1) such that N (u, w) \u2265 \u2206(cid:48) and N (u, \u03c0(u)), N (\u03c0\u22121(w), w) <\n8: return \u03c0.\n\n\u2206(cid:48)/10 do Modify \u03c0 by mapping u to w and mapping \u03c0\u22121(w) to \u03c0(u). end while\n\n(cid:46) \u03c00 is a (\u03b8, \u03b7)-partial assignment.\n\nReferences\n[1] Timothee Cour, Praveen Srinivasan, and Jianbo Shi. Balanced graph matching. In Advances in\n\nNeural Information Processing Systems, pages 313\u2013320, 2007.\n\n[2] Minsu Cho and Kyoung Mu Lee. Progressive graph matching: Making a move of graphs\nvia probabilistic voting. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE\nConference on, pages 398\u2013405. IEEE, 2012.\n\n[3] Alexander C Berg, Tamara L Berg, and Jitendra Malik. Shape matching and object recognition\nusing low distortion correspondences. In Computer Vision and Pattern Recognition, 2005.\nCVPR 2005. IEEE Computer Society Conference on, volume 1, pages 26\u201333. IEEE, 2005.\n\n7\n\n\f[4] Rohit Singh, Jinbo Xu, and Bonnie Berger. Global alignment of multiple protein interaction\nnetworks with application to functional orthology detection. Proceedings of the National\nAcademy of Sciences, 105(35):12763\u201312768, 2008.\n\n[5] Joshua T Vogelstein, John M Conroy, Louis J Podrazik, Steven G Kratzer, Eric T Harley,\nDonniell E Fishkind, R Jacob Vogelstein, and Carey E Priebe. Large (brain) graph matching via\nfast approximate quadratic programming. arXiv preprint arXiv:1112.5507, 2011.\n\n[6] Nitish Korula and Silvio Lattanzi. An ef\ufb01cient reconciliation algorithm for social networks.\n\nProceedings of the VLDB Endowment, 7(5):377\u2013388, 2014.\n\n[7] Arvind Narayanan and Vitaly Shmatikov. De-anonymizing social networks. In Security and\n\nPrivacy, 2009 30th IEEE Symposium on, pages 173\u2013187. IEEE, 2009.\n\n[8] Younghee Park, Douglas Reeves, Vikram Mulukutla, and Balaji Sundaravel. Fast malware\nclassi\ufb01cation by automated behavioral graph matching. In Proceedings of the Sixth Annual\nWorkshop on Cyber Security and Information Intelligence Research, CSIIRW \u201910, pages 45:1\u2013\n45:4, New York, NY, USA, 2010. ACM.\n\n[9] Lorenzo Livi and Antonello Rizzi. The graph matching problem. Pattern Analysis and\n\nApplications, 16(3):253\u2013283, 2013.\n\n[10] Donatello Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento. Thirty years of graph\nmatching in pattern recognition. International journal of pattern recognition and arti\ufb01cial\nintelligence, 18(03):265\u2013298, 2004.\n\n[11] Pedram Pedarsani and Matthias Grossglauser. On the privacy of anonymized networks. In\nProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 1235\u20131243. ACM, 2011.\n\n[12] Lyudmila Yartseva and Matthias Grossglauser. On the performance of percolation graph\nIn Proceedings of the \ufb01rst ACM conference on Online social networks, pages\n\nmatching.\n119\u2013130. ACM, 2013.\n\n[13] Vince Lyzinski, Donniell E Fishkind, and Carey E Priebe. Seeded graph matching for correlated\n\nerd\u00f6s-r\u00e9nyi graphs. Journal of Machine Learning Research, 15(1):3513\u20133540, 2014.\n\n[14] Ehsan Kazemi, S Hamed Hassani, and Matthias Grossglauser. Growing a graph matching from\n\na handful of seeds. Proceedings of the VLDB Endowment, 8(10):1010\u20131021, 2015.\n\n[15] Daniel Cullina and Negar Kiyavash. Improved achievability and converse bounds for erdos-renyi\ngraph matching. In ACM SIGMETRICS Performance Evaluation Review, volume 44, pages\n63\u201372. ACM, 2016.\n\n[16] Daniel Cullina and Negar Kiyavash. Exact alignment recovery for correlated erdos renyi graphs.\n\narXiv preprint arXiv:1711.06783, 2017.\n\n[17] Hugo Gascon, Fabian Yamaguchi, Daniel Arp, and Konrad Rieck. Structural detection of\nandroid malware using embedded call graphs. In Proceedings of the 2013 ACM workshop on\nArti\ufb01cial intelligence and security, pages 45\u201354. ACM, 2013.\n\n[18] Neha Runwal, Richard M Low, and Mark Stamp. Opcode graph similarity and metamorphic\n\ndetection. Journal in Computer Virology, 8(1-2):37\u201352, 2012.\n\n[19] John W Raymond, Eleanor J Gardiner, and Peter Willett. Heuristics for similarity searching\nof chemical graphs using a maximum common edge subgraph algorithm. Journal of chemical\ninformation and computer sciences, 42(2):305\u2013316, 2002.\n\n[20] Masahiro Hattori, Yasushi Okuno, Susumu Goto, and Minoru Kanehisa. Heuristics for chemical\n\ncompound matching. Genome Informatics, 14:144\u2013153, 2003.\n\n[21] Maureen Heymans and Ambuj K Singh. Deriving phylogenetic trees from the similarity analysis\n\nof metabolic pathways. Bioinformatics, 19(suppl_1):i138\u2013i146, 2003.\n\n8\n\n\f[22] V. Lyzinski, D. E. Fishkind, M. Fiori, J. T. Vogelstein, C. E. Priebe, and G. Sapiro. Graph\nIEEE Transactions on Pattern Analysis and Machine\n\nmatching: Relax at your own risk.\nIntelligence, 38(1):60\u201373, Jan 2016.\n\n[23] Jon M Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM\n\n(JACM), 46(5):604\u2013632, 1999.\n\n[24] Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. Similarity \ufb02ooding: A versatile graph\nmatching algorithm and its application to schema matching. In Proceedings 18th International\nConference on Data Engineering, pages 117\u2013128. IEEE, 2002.\n\n[25] Elizabeth A Leicht, Petter Holme, and Mark EJ Newman. Vertex similarity in networks.\n\nPhysical Review E, 73(2):026120, 2006.\n\n[26] Laura A Zager and George C Verghese. Graph similarity scoring and matching. Applied\n\nmathematics letters, 21(1):86\u201394, 2008.\n\n[27] Yunsheng Bai, Hao Ding, Yizhou Sun, and Wei Wang. Convolutional set matching for graph\n\nsimilarity. arXiv preprint arXiv:1810.10866, 2018.\n\n[28] Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, and Wei Wang. Simgnn: A\nneural network approach to fast graph similarity computation. In Proceedings of the Twelfth\nACM International Conference on Web Search and Data Mining, WSDM \u201919, pages 384\u2013392,\nNew York, NY, USA, 2019. ACM.\n\n[29] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural\n\nnetworks? arXiv preprint arXiv:1810.00826, 2018.\n\n[30] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n[31] B\u00e9la Bollob\u00e1s. Random graphs. pages 80\u2013102. Cambridge University Press, 1981.\n\n[32] Vince Lyzinski, Sancar Adali, Joshua T Vogelstein, Youngser Park, and Carey E Priebe.\nSeeded graph matching via joint optimization of \ufb01delity and commensurability. arXiv preprint\narXiv:1401.3813, 2014.\n\n[33] Svante Janson, Tomasz Luczak, Tatyana Turova, Thomas Vallier, et al. Bootstrap percolation on\n\nthe random graph gn,p. The Annals of Applied Probability, 22(5):1989\u20132047, 2012.\n\n9\n\n\f", "award": [], "sourceid": 4930, "authors": [{"given_name": "Boaz", "family_name": "Barak", "institution": "Harvard University"}, {"given_name": "Chi-Ning", "family_name": "Chou", "institution": "Harvard University"}, {"given_name": "Zhixian", "family_name": "Lei", "institution": "Harvard University"}, {"given_name": "Tselil", "family_name": "Schramm", "institution": "Harvard University"}, {"given_name": "Yueqi", "family_name": "Sheng", "institution": "Harvard University"}]}