{"title": "Structure learning of antiferromagnetic Ising models", "book": "Advances in Neural Information Processing Systems", "page_first": 2852, "page_last": 2860, "abstract": "In this paper we investigate the computational complexity of learning the graph structure underlying a discrete undirected graphical model from i.i.d. samples. Our first result is an unconditional computational lower bound of $\\Omega (p^{d/2})$ for learning general graphical models on $p$ nodes of maximum degree $d$, for the class of statistical algorithms recently introduced by Feldman et al. The construction is related to the notoriously difficult learning parities with noise problem in computational learning theory. Our lower bound shows that the $\\widetilde O(p^{d+2})$ runtime required by Bresler, Mossel, and Sly's exhaustive-search algorithm cannot be significantly improved without restricting the class of models. Aside from structural assumptions on the graph such as it being a tree, hypertree, tree-like, etc., most recent papers on structure learning assume that the model has the correlation decay property. Indeed, focusing on ferromagnetic Ising models, Bento and Montanari showed that all known low-complexity algorithms fail to learn simple graphs when the interaction strength exceeds a number related to the correlation decay threshold. Our second set of results gives a class of repelling (antiferromagnetic) models that have the \\emph{opposite} behavior: very strong repelling allows efficient learning in time $\\widetilde O(p^2)$. We provide an algorithm whose performance interpolates between $\\widetilde O(p^2)$ and $\\widetilde O(p^{d+2})$ depending on the strength of the repulsion.", "full_text": "Structure learning of\n\nantiferromagnetic Ising models\n\nGuy Bresler1 David Gamarnik2 Devavrat Shah1\n\nLaboratory for Information and Decision Systems\n\nDepartment of EECS1 and Sloan School of Management2\n\nMassachusetts Institute of Technology\n\n{gbresler,gamarnik,devavrat}@mit.edu\n\nAbstract\n\nIn this paper we investigate the computational complexity of learning the\ngraph structure underlying a discrete undirected graphical model from i.i.d.\nsamples. Our \ufb01rst result is an unconditional computational lower bound\nof (pd/2) for learning general graphical models on p nodes of maximum\ndegree d, for the class of so-called statistical algorithms recently introduced\nby Feldman et al. [1]. The construction is related to the notoriously dicult\nlearning parities with noise problem in computational learning theory. Our\n\nlower bound suggests that the \u00c2O(pd+2) runtime required by Bresler, Mossel,\n\nand Sly\u2019s [2] exhaustive-search algorithm cannot be signi\ufb01cantly improved\nwithout restricting the class of models.\nAside from structural assumptions on the graph such as it being a tree,\nhypertree, tree-like, etc., many recent papers on structure learning assume\nthat the model has the correlation decay property. Indeed, focusing on fer-\nromagnetic Ising models, Bento and Montanari [3] showed that all known\nlow-complexity algorithms fail to learn simple graphs when the interaction\nstrength exceeds a number related to the correlation decay threshold. Our\nsecond set of results gives a class of repelling (antiferromagnetic) models\nthat have the opposite behavior: very strong interaction allows ecient\n\nlearning in time \u00c2O(p2). We provide an algorithm whose performance in-\nterpolates between \u00c2O(p2) and \u00c2O(pd+2) depending on the strength of the\n\nrepulsion.\n\n1 Introduction\n\nGraphical models have had tremendous impact in a variety of application domains. For\nunstructured high-dimensional distributions, such as in social networks, biology, and \ufb01nance,\nan important \ufb01rst step is to determine which graphical model to use.\nIn this paper we\nfocus on the problem of structure learning: Given access to n independent and identically\ndistributed samples \u2021(1), . . .\u2021 (n) from an undirected graphical model representing a discrete\nrandom vector \u2021 = (\u20211, . . . ,\u2021 p), the goal is to \ufb01nd the graph G underlying the model. Two\nbasic questions are 1) How many samples are required? and 2) What is the computational\ncomplexity?\nIn this paper we are mostly interested in the computational complexity of structure learning.\nWe \ufb01rst consider the problem of learning a general discrete undirected graphical model of\nbounded degree.\n\n1\n\n\f1.1 Learning general graphical models\n\nSeveral algorithms based on exhaustively searching over possible node neighborhoods have\nappeared in the last decade [4, 2, 5]. Abbeel, Koller, and Ng [4] gave algorithms for learning\ngeneral graphical models close to the true distribution in Kullback-Leibler distance. Bresler,\nMossel, and Sly [2] presented algorithms guaranteed to learn the true underlying graph.\nThe algorithms in both [4] and [2] perform a search over candidate neighborhoods, and for\na graph of maximum degree d, the computational complexity for recovering a graph on p\n\nWhile the algorithms in [2] are guaranteed to reconstruct general models under basic\nnondegeneracy conditions using an optimal number of samples n = O(d log p) (sample\ncomplexity lower bounds were proved by Santhanam and Wainwright [6] as well as [2]), the\n\nnodes scales as \u00c2O(pd+2) (where the \u00c2O notation hides logarithmic factors).\nexponent d in the \u00c2O(pd+2) run-time is impractically high even for constant but large graph\n\ndegrees. This has motivated a great deal of work on structure learning for special classes of\ngraphical models. But before giving up on general models, we ask the following question:\nQuestion 1:\nIs it possible to learn the structure of general graphical models on p\nnodes with maximum degree d using substantially less computation than pd?\n\nOur \ufb01rst result suggests that the answer to Question 1 is negative. We show an uncon-\nditional computational lower bound of p d\n2 for the class of statistical algorithms introduced\nby Feldman et al. [1]. This class of algorithms was introduced in order to understand the\napparent diculty of the Planted Clique problem, and is based on Kearns\u2019 statistical query\nmodel [7]. Kearns showed in his landmark paper that statistical query algorithms require\nexponential computation to learn parity functions subject to classi\ufb01cation noise, and our\nhardness construction is related to this problem. Most known algorithmic approaches (in-\ncluding Markov chain Monte Carlo, semide\ufb01nite programming, and many others) can be\nimplemented as statistical algorithms, so the lower bound is fairly convincing.\nWe give background and prove the following theorem in Section 4.\nTheorem 1.1. Statistical algorithms require at least (p d\nlearn the structure of a general graphical models of degree d.\n\n2 ) computation steps in order to\n\nIf complexity pd is to be considered intractable, what shall we consider as tractable? Writing\nalgorithm complexity in the form c(d)pf(d), for high-dimensional (large p) problems the\nexponent f(d) is of primary importance, and we will think of tractable algorithms as having\nan f(d) that is bounded by a constant independent of d. The factor c(d) is also important,\nand we will use it to compare algorithms with the same exponent f(d).\nIn light of Theorem 1.1, reducing computation below p(d) requires restricting the class\nof models. One can either restrict the graph structure or the nature of the interactions\nbetween variables. The seminal paper of Chow and Liu [8] makes a model restriction of\nthe \ufb01rst type, assuming that the graph is a tree; generalizations include to polytrees [9],\nhypertrees [10], and others. Among the many possible assumptions of the second type,\nthe correlation decay property is distinguished: to the best of our knowledge all existing\nlow-complexity algorithms require the correlation decay property [3].\n\n1.2 Correlation decay property\n\nInformally, a graphical model is said to have the correlation decay property (CDP) if any\ntwo variables \u2021s and \u2021t are asymptotically independent as the graph distance between s and\nt increases. Exponential decay of correlations holds when the distance from independence\ndecreases exponentially fast in graph distance, and we will mean this stronger form when\nreferring to correlation decay. Correlation decay is known to hold for a number of pairwise\ngraphical models in the so-called high-temperature regime, including Ising, hard-core lattice\ngas, Potts (multinomial) model, and others (see, e.g., [11, 12, 13, 14, 15, 16]).\n\n2\n\n\fBresler, Mossel, and Sly [2] observed that it is possible to eciently learn models with (ex-\nponential) decay of correlations, under the additional assumption that neighboring variables\nhave correlation bounded away from zero (as is true, e.g., for the ferromagnetic Ising model\nin the high temperature regime). The algorithm they proposed for this setting pruned the\ncandidate set of neighbors for each node to roughly size O(d) by retaining only those variables\nwith suciently high correlations, and then within this set performed the exhaustive search\n\nover neighborhoods mentioned before, resulting in a computational cost of dO(d)\u00c2O(p2). The\n\ngreedy algorithms of Netrapali et al. [17] and Ray et al. [18] also require the correlation de-\ncay property and perform a similar pruning step by retaining only nodes with high pairwise\ncorrelation; they then use a dierent method to select the true neighborhood.\nA number of papers consider the problem of reconstructing Ising models on graphs with\nfew short cycles, beginning with Anandkumar et al. [19]. Their results apply to the case of\np).\nIsing models on sparsely connected graphs such as the Erd\u00a8os-Renyi random graph G(p, d\nThey additionally require the interaction parameters to be either generic or ferromagnetic.\nFerromagnetic models have the bene\ufb01t that neighbors always have a non-negligible correla-\ntion because the dependencies cannot cancel, but in either case the results still require the\nCDP to hold. Wu et al. [20] remove the assumption of generic parameters in [19], but again\nrequire the CDP.\nOther algorithms for structure learning are based on convex optimization, such as Raviku-\nmar et al.\u2019s [21] approach using regularized node-wise logistic regression. While this\nalgorithm does not explicitly require the CDP, Bento and Montanari [3] found that the\nlogistic regression algorithm of [21] provably fails to learn certain ferromagnetic Ising model\non simple graphs without correlation decay. Other convex optimization-based algorithms\nsuch as [22, 23, 24] require similar incoherence or restricted isometry-type conditions that\nare dicult to verify, but likely also require correlation decay. Since all known algorithms\nfor structure learning require the CDP, we ask the following question (paraphrasing Bento\nand Montanari):\nQuestion 2:\nexhibit the CDP, on general bounded degree graphs?\n\nIs low-complexity structure learning possible for models which do not\n\nOur second main result answers this question armatively by showing that a broad class of\nrepelling models on general graphs can be learned using simple algorithms, even when the\nunderlying model does not exhibit the CDP.\n\n1.3 Repelling models\nThe antiferromagnetic Ising model has a negative interaction parameter, whereby neighbor-\ning nodes prefer to be in opposite states. Other popular antiferromagnetic models include\nthe Potts or coloring model, and the hard-core model.\nAntiferromagnetic models have the interesting property that correlations between neighbors\ncan be zero due to cancellations. Thus algorithms based on pruning neighborhoods using\npairwise correlations, such as the algorithm in [2] for models with correlation decay, does not\nwork for anti-ferromagnetic models. To our knowledge there are no previous results that\nimprove on the pd computational complexity for structure learning of antiferromagnetic\nmodels on general graphs of maximum degree d.\nOur \ufb01rst learning algorithm, described in Section 2, is for the hard-core model.\nTheorem 1.2 (Informal). It is possible to learn strongly repelling models, such as the hard-\n\ncore model, with run-time \u00c2O(p2).\nWe extend this result to weakly repelling models (equivalent to the antiferromagnetic Ising\nmodel parameterized in a nonstandard way, see Section 3). Here \u2014 is a repelling strength\nand h is an external \ufb01eld.\nTheorem 1.3 (Informal). Suppose \u2014 \u00d8 (d \u2260 \u2013)(h + ln 2) for an integer 0 \u00c6 \u2013< d . Then\nit is possible to learn a repelling model with interaction \u2014, with run-time \u00c2O(p2+\u2013).\n\n3\n\n\fThe computational complexity of the algorithm interpolates between \u00c2O(p2), achievable for\nstrongly repelling models, and \u00c2O(pd+2), achievable for general models using exhaustive\n\nsearch. The complexity depends on the repelling strength of the model, rather than struc-\ntural assumptions on the graph as in [19, 20].\nWe remark that the strongly repelling models exhibit long-range correlations, yet the algo-\nrithmic task of graph structure learning is possible using a local procedure.\nThe focus of this paper is on structure learning, but the problem of parameter estimation\nis equally important.\nIt turns out that the structure learning problem is strictly more\nchallenging for the models we consider: once the graph is known, it is not dicult to\nestimate the parameters with low computational complexity (see, e.g., [4]).\n\n2 Learning the graph of a hard-core model\nWe warm up by considering the hard-core model. The analysis in this section is straightfor-\nward, but serves as an example to highlight the fact that correlation decay is not a necessary\ncondition for structure learning.\nGiven a graph G = (V, E) on |V | = p nodes, denote by I(G) \u2122{ 0, 1}p the set of independent\nset indicator vectors \u2021, for which at least one of \u2021i or \u2021j is zero for each edge {i, j}\u0153 E(G).\nThe hardcore model with fugacity \u2044> 0 assigns nonzero probability only to vectors in I(G),\nwith\n(2.1)\n\n,\u2021\n\nHere |\u2021| is the number of entries of \u2021 equal to one and Z =q\u2021\u0153I(G) \u2044|\u2021| is the normalizing\n\nIf \u2044> 1 then more mass is assigned to larger\nconstant called the partition function.\nindependent sets. (We use indicator vectors to de\ufb01ne the model in order to be consistent\nwith the antiferromagnetic Ising model in the next section.)\nOur goal is to learn the graph G = (V, E) underlying the model (2.1) given access to inde-\npendent samples \u2021(1), . . . ,\u2021 (n). The following simple algorithm reconstructs G eciently.\n\nP(\u2021) = \u2044|\u2021|\nZ\n\n\u0153I (G) .\n\nAlgorithm 1 simpleHC(\u2021(1), . . . ,\u2021 (n))\n1: FOR each i, j, k:\n(k)\n2: IF \u2021\n3: OUTPUT \u02c6E = Sc\n\nj = 1, THEN S = S \ufb01{ i, j}\n\n(k)\ni = \u2021\n\n(k)\ni = \u2021\n\n(k)\n\n(k)\ni = 0 or \u2021\n\nThe idea behind the algorithm is very simple. If {i, j} belongs to the edge set E(G), then\n(k)\nfor every sample \u2021(k) either \u2021\nj = 0 (or both). Thus for every i, j and k such\nthat \u2021\nj = 1 we can safely declare {i, j} not to be an edge. To show correctness of\nthe algorithm it is therefore sucient to argue that for every non-edge {i, j} there is a high\nlikelihood that such an independent set \u2021(k) will be sampled.\nBefore doing this, we observe that simpleHC actually computes the maximum-likelihood\n(k)\nestimate for the graph G. To see this, note that an edge e = {i, j} for which \u2021\nj = 1\nfor some k cannot be in \u02c6G, since P(\u2021(k)| \u02c6G+e) = 0 for any \u02c6G. Thus the ML estimate contains\na subset of those edges e which have not been ruled out by \u2021(1), . . . ,\u2021 (n). But adding any\nsuch edge e to the graph decreases the value of the partition function in (2.1) (the sum is\nover fewer independent sets), thereby increasing the likelihood of each of the samples.\nThe sample complexity and computational complexity of simpleHC is as follows, with proof\nin the Supplement.\nTheorem 2.1. Consider the hard-core model (2.1) on a graph G = (V, E) on |V | = p nodes\nand with maximum degree d. The sample complexity of simpleHC is\n(2.2)\n\nn = O((2\u2044)2d\u22602 log p) ,\n\n(k)\ni = \u2021\n\n4\n\n\fi.e. with this many samples the algorithm simpleHC correctly reconstructs the graph with\nprobability 1 \u2260 o(1). The computational complexity is\n\nO(np2) = O((2\u2044)2d\u22602p2 log p) .\n\n(2.3)\n\nWe next show that the sample complexity bound in Theorem 2.1 is basically tight:\nTheorem 2.2 (Sample complexity lower bound). Consider the hard-core model (2.1). There\nis a family of graphs on p nodes with maximum degree d such that for the probability of\nsuccessful reconstruction to approach one, the number of samples must scale as\n\nn = 1(2\u2044)2d log p\nd2 .\n\nLemma 2.3. Suppose edge e = (i, j) /\u0153 G, and let I be an independent set chosen according\nto the Gibbs distribution (2.1). Then P({i, j}\u2122 I) \u00d8 (9 \u00b7 max{1, (2\u2044)2d\u22602})\u22601 , \u201c.\nThe Supplementary Material contains proofs for Theorem 2.2 and Lemma 2.3.\n\n3 Learning anti-ferromagnetic Ising models\nIn this section we consider the anti-ferromagnetic Ising model on a graph G = (V, E). We\nparametrize the model in such a way that each con\ufb01guration has probability\n\nZ\n\nexp)H(\u2021)* ,\u2021\n\nP(\u2021) = 1\nH(\u2021) = \u2260\u2014 \u00ff(i,j)\u0153E\n\n\u2021i\u2021j +\u00ffi\u0153V\n\n\u0153{ 0, 1}p ,\n\nhi\u2021i .\n\n(3.1)\n\n(3.2)\n\nHere \u2014> 0 and {hi}i\u0153V are real-valued parameters, and we assume that |hi|\u00c6 h for all i.\nWorking with con\ufb01gurations in {0, 1}p rather than the more typical {\u22601, +1}p amounts to\na reparametrization (which is without loss of generality as shown for example in Appendix 1\nof [25]). Setting hi = h = ln \u2044 for all i, we recover the hard-core model with fugacity \u2044 in\nthe limit \u2014 \u00e6 \u0152, so we think of (3.2) as a \u201csoft\u201d independent set model.\n3.1 Strongly antiferromagnetic models\nWe start by considering the situation in which the repelling strength \u2014 is suciently large\nthat we can modify the approach used for the hard-core model. We require some notation\nto work with conditional probabilities: for each vertex b \u0153 V , let\n\nwhere\n\nand\n\nBb = {\u2021(i) : \u2021\n\u02c6P(\u2021a = 1|\u2021b = 1) := 1\n\n(i)\n\nb = 1} ,\n\n|B||{i \u0153 B : \u2021(i)\n\na = 1}| .\n\nOf course, E!\u02c6P(\u2021a = 1|\u2021b = 1)\" = P(\u2021a = 1|\u2021b = 1). The algorithm, described next,\ndetermines whether each edge {a, b} is present based on comparing \u02c6P to a threshold.\nAlgorithm 2 StrongRepelling\nInput: \u2014, h, d, and n samples \u2021(1), . . . ,\u2021 (n) \u0153{ 0, 1}p. Output: edge set \u02c6E.\n1: Let \u201d = (1 + 2deh(d\u22601))\u22602\n2: FOR each possible edge {a, b}\u0153 !V\n2\":\n3: IF \u02c6P(\u2021a = 1|\u2021b = 1) \u00c6 (1 + e\u2014\u2260h)\u22601 + \u201d THEN add edge (a, b) to \u02c6E\n4: OUTPUT \u02c6E\n\nAlgorithm StrongRepelling obtains the following performance. The proof of Proposi-\ntion 3.1 is similar to that of Theorem 2.1, replacing Lemma 2.3 by Lemma 3.2 below.\n\n5\n\n\fProposition 3.1. Consider the antiferromagnetic Ising model (3.2) on a graph G = (V, E)\non p nodes and with maximum degree d. If\n\nthen algorithm StrongRepelling has sample complexity\n\n\u2014 \u00d8 d(h + ln 2) ,\n\ni.e. this many samples are sucient to reconstruct the graph with probability 1 \u2260 o(1). The\ncomputational complexity of StrongRepelling is\n\nn = O122de2h(d+1) log p2 ,\nO(np2) = O122de2h(d+1)p2 log p2 .\n\nWhen the interaction parameter \u2014 \u00d8 d(h+ln 2) it is possible to identify edges using pairwise\nstatistics. The next lemma, proved in the Supplement, shows the desired separation.\nLemma 3.2. We have the following estimates:\n(i) If (a, b) /\u0153 E(G), then P(\u2021a = 1|\u2021b = 1) \u00d8\n(ii) Conversely, if (a, b) \u0153 E(G), then P(\u2021a = 1|\u2021b = 1) \u00c6\n(ii) For any b \u0153 V , P(\u2021b = 1) \u00d8\n\n1+2deg(a)eh(deg(a)+1) .\n\n1+2deg(b)eh(deg(b)+1) .\n\n1\n\n1+e\u2014\u2260h .\n\n1\n\n1\n\n3.2 Weakly antiferromagnetic models\nIn this section we focus on learning weakly repelling models and show a trade-o between\ncomputational complexity and strength of the repulsion. Recall that for strongly repelling\nmodels our algorithm has run-time O(p2 log p), the same as for the hard-core model (in\ufb01nite\nrepulsion).\nFor a subset of nodes U \u2122 V , let G\\U denote the graph obtained from G by removing nodes\nin U (as well as any edges incident to nodes in U). The following corollary is immediate\nfrom Lemma 3.2.\nCorollary 3.3. We have the conditional probability estimates for deleting subsets of nodes:\n\n(i) If (a, b) /\u0153 E(G), then for any subset of nodes U \u00b5 V \\ {a, b},\n\nPG\\U(\u2021a = 1|\u2021b = 1) \u00d8\n\n1 + 2degG\\U (a)eh(degG\\U (a)+1) .\n\n(ii) Conversely, if (a, b) \u0153 E(G), then for any subset of nodes U \u2122 V \\ {a, b}\n\n1\n\n1\n\nPG\\U(\u2021a = 1|\u2021b = 1) \u00c6\n\n1 + e\u2014\u2260h .\n\nWe can eectively remove nodes from the graph by conditioning: The family of models (3.2)\nhas the property that conditioning on \u2021i = 0 amounts to removing node i from the graph.\nFact 3.4 (Self-reducibility). Let G = (V, E), and consider the model 3.2. Then for any\nsubset of nodes U \u2122 V , the probability law PG(\u2021 \u0153\u00b7| \u2021U = 0) is equal to PG\\U(\u2021V \\U \u0153\u00b7 ).\nThe \ufb01nal ingredient is to show that we can condition by restricting attention to a subset of\nthe observed data, \u2021(1), . . . ,\u2021 (n), without throwing away too many samples.\nLemma 3.5. Let U \u2122 V be a subset of nodes and denote the subset of samples with variables\n\u2021U equal to zero by AU = {\u2021(i) : \u2021\nu = 0 for all u \u0153 U}. Then with probability at least\n1 \u2260 exp(n/2(1 + eh)2|U|) the number |AU| of such samples is at least n\nWe now present the algorithm. Eectively, it reduces node degree by removing nodes (which\ncan be done by conditioning on value zero), and then applies the strong repelling algorithm\nto the residual graph.\n\n2 \u00b7 (1 + eh)\u2260|U|.\n\n(i)\n\n6\n\n\fAlgorithm 3 WeakRepelling\nInput: \u2014, h, d, and n samples \u2021(1), . . . ,\u2021 (n) \u0153{ 0, 1}p. Output: edge set \u02c6E.\n1: Let \u201d = (1 + 2deh(d\u22601))\u22602\n2: FOR each possible edge (a, b) \u0153!V\n2\":\nFOR each U \u2122 V \\ {a, b} of size |U|\u00c6 \u00c1 d \u2260 \u2014/(h + ln 2)\u00cb\nCompute \u02c6PG\\U(\u2021a = 1|\u2021b = 1)\nIF minU:|U|= \u02c6PG\\U(\u2021a = 1|\u2021b = 1) \u00c6 (1 + e\u2014\u2260h) + \u201d THEN add edge (a, b) to \u02c6E\n\n3:\n4:\n5:\n6: OUTPUT \u02c6E\n\nTheorem 3.6. Let \u2013 be a nonnegative integer strictly smaller than d, and consider the\nantiferromagnetic Ising model 3.2 with\n\non a graph G. Algorithm WeakRepelling reconstructs the graph with probability 1 \u2260 o(1)\nas p \u00e6 \u0152 using\ni.i.d. samples, with run-time\n\n\u2014 \u00d8 (d \u2260 \u2013)(h + ln 2)\n\nn = O1(1 + eh)\u201322de2h(d+1) log p2\nO!np2+\u2013\" = \u00c2Oh,d(p2+\u2013) .\n\n4 Statistical algorithms and proof of Theorem 1.1\nWe start by describing the statistical algorithm framework introduced by [1]. In this section\nit is convenient to work with variables taking values in {\u22601, +1} rather than {0, 1}.\n4.1 Background on statistical algorithms\nLet X = {\u22601, +1}p denote the space of con\ufb01gurations and let D be a set of distributions\nover X. Let F be a set of solutions (in our case, graphs) and Z : D\u00e6 2F be a map taking\neach distribution D \u0153D to a subset of solutions Z(D) \u2122F that are de\ufb01ned to be valid\nsolutions for D. In our setting, since each graphical model is identi\ufb01able, there is a single\ngraph Z(D) corresponding to each distribution D. For n > 0, the distributional search\nproblem Z over D and F using n samples is to \ufb01nd a valid solution f \u0153Z (D) given access\nto n random samples from an unknown D \u0153D .\nThe class of algorithms we are interested in are called unbiased statistical algorithms, de\ufb01ned\nby access to an unbiased oracle. Other related classes of algorithms are de\ufb01ned in [1], and\nsimilar lower bounds can be derived for those as well.\nDe\ufb01nition 4.1 (Unbiased Oracle). Let D be the true distribution. The algorithm is given\naccess to an oracle, which when given any function h : X\u00e6{\n0, 1}, takes an independent\nrandom sample x from D and returns h(x).\nThese algorithms access the sampled data only through the oracle: unbiased statistical\nalgorithms outsource the computation. Because the data is accessed through the oracle, it\nis possible to prove unconditional lower bounds using information-theoretic methods. As\nnoted in the introduction, many algorithmic approaches can be implemented as statistical\nalgorithms.\nWe now de\ufb01ne a key quantity called average correlation. The average correlation of a subset\nof distributions D\u00d5 \u2122D relative to a distribution D is denoted \ufb02(D\u00d5, D),\n\n\ufb02(D\u00d5, D) := 1\n\n|D\u00d5|2 \u00ffD1,D2\u0153D\u00d5----= D1\n\nD \u2260 1,\n\nD2\n\nD \u2260 1>D---- ,\n\nwhere \u00c8f, g\u00cdD := Ex\u2265D[f(x)g(x)] and the ratio D1/D represents the ratio of probability\nmass functions, so (D1/D)(x) = D1(x)/D(x).\nWe quote the de\ufb01nition of statistical dimension with average correlation from [1], and then\nstate a lower bound on the number of queries needed by any statistical algorithm.\n\n(4.1)\n\n7\n\n\fm = min; \u00b8(\u201d \u2260 \u00f7)\n2(1 \u2260 \u00f7) ,\n\n12\u201c < .\n(\u201d \u2260 \u00f7)2\n\nIn particular, if \u00f7 \u00c6 1/6, then any algorithm with success probability at least 2/3 requires at\nleast min{\u00b8/4, 1/48\u201c} samples from the Unbiased Oracle.\nIn order to show that a graphical model on p nodes of maximum degree d requires\ncomputation p(d) in this computational model, we therefore would like to show that\nSDA(Z,\u201c,\u00f7 ) = p(d) with \u201c = p\u2260(d).\n4.2 Soft parities\n\nFor any subset S \u00b5 [p] of cardinality |S| = d, let \u2030S(x) =ri\u0153S xi be the parity of variables\nin S. De\ufb01ne a probability distribution by assigning mass to x \u0153 {\u22601, +1}p according to\n\npS(x) = 1\n\nexp(c \u00b7 \u2030S(x)) .\n\nZ\n\n(4.2)\n\nDe\ufb01nition 4.2 (Statistical dimension). Fix \u201c> 0,\u00f7 > 0, and search problem Z over set\nof solutions F and class of distributions D over X. We consider pairs (D,DD) consisting\nof a \u201creference distribution\u201d D over X and a \ufb01nite set of distributions DD \u2122D with the\nfollowing property: for any solution f \u0153F , the set Df = DD \\ Z\u22601(f) has size at least\n(1 \u2260 \u00f7) \u00b7 |DD|. Let \u00b8(D,DD) be the largest integer \u00b8 so that for any subset D\u00d5 \u2122D f with\n|D\u00d5|\u00d8|D f|/\u00b8, the average correlation is |\ufb02(D\u00d5, D)| <\u201c (if there is no such \u00b8 one can take\n\u00b8 = 0). The statistical dimension with average correlation \u201c and solution set bound \u00f7 is\nde\ufb01ned to be the largest \u00b8(D,DD) for valid pairs (D,DD) as described, and is denoted by\nSDA(Z,\u201c,\u00f7 ).\nTheorem 4.3 ([1]). Let X be a domain and Z a search problem over a set of solutions F\nand a class of distributions D over X. For \u201c> 0 and \u00f7 \u0153 (0, 1), let \u00b8 = SDA(Z,\u201c,\u00f7 ). Any\n(possibly randomized) unbiased statistical algorithm that solves Z with probability \u201d requires\nat least m calls to the Unbiased Oracle for\n\nHere c is a constant, and the partition function is\n\nZ =\u00ffx\n\n4\n\n(ec+e\u2260c)2 \u00c6 1.\n\nexp(c \u00b7 \u2030S(x)) = 2p\u22601(ec + e\u2260c) .\n\n(4.3)\nOur family of distributions D is given by these soft parities over subsets S \u00b5 [p], and |D| =\n!p\nd\". The following lemma, proved in the supplementary material, computes correlations\nbetween distributions.\nLemma 4.4. Let U denote the uniform distribution on {\u22601, +1}p. For S \u201d= T, the corre-\nU \u2260 1\u00cd is exactly equal to zero for any value of c. If S = T, the correlation\nU \u2260 1, pT\nlation \u00c8 pS\nU \u2260 1, pS\nU \u2260 1\u00cd = 1 \u2260\n\u00c8 pS\nLemma 4.5. For any set D\u00d5 \u2122D of size at least |D|/pd/2, the average correlation satis\ufb01es\n\ufb02(D\u00d5, U) \u00c6 ddp\u2260d/2 .\nProof. By the preceding lemma, the only contributions to the sum (4.1) comes from choosing\nthe same set S in the sum, of which there are a fraction 1/|D\u00d5|. Each such correlation is at\nmost one by Lemma 4.4, so \ufb02 \u00c6 1/|D\u00d5|\u00c6 pd/2/|D| = pd/2/!p\nd\" \u00c6 dd/pd/2. Here we used the\nestimate!n\nProof of Theorem 1.1. Let \u00f7 = 1/6 and \u201c = ddp\u2260d/2, and consider the set of distributions\nD given by soft parities as de\ufb01ned above. With reference distribution D = U, the uniform\ndistribution, Lemma 4.5 implies that SDA(Z,\u201c,\u00f7 ) of the structure learning problem over\ndistribution (4.2) is at least \u00b8 = pd/2/dd. The result follows from Theorem 4.3.\n\nk\" \u00d8 ( n\n\nk )k.\n\nAcknowledgments\nThis work was supported in part by NSF grants CMMI-1335155 and CNS-1161964, and by\nArmy Research Oce MURI Award W911NF-11-1-0036.\n\n8\n\n\fReferences\n[1] V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao, \u201cStatistical algorithms and a\n\nlower bound for detecting planted cliques,\u201d in STOC, pp. 655\u2013664, ACM, 2013.\n\n[2] G. Bresler, E. Mossel, and A. Sly, \u201cReconstruction of Markov random \ufb01elds from samples:\nSome observations and algorithms,\u201d Approximation, Randomization and Combinatorial Opti-\nmization, pp. 343\u2013356, 2008.\n\n[3] J. Bento and A. Montanari, \u201cWhich graphical models are dicult to learn?,\u201d in NIPS, 2009.\n[4] P. Abbeel, D. Koller, and A. Y. Ng, \u201cLearning factor graphs in polynomial time and sample\n\ncomplexity,\u201d The Journal of Machine Learning Research, vol. 7, pp. 1743\u20131788, 2006.\n\n[5] I. Csisz\u00b4ar and Z. Talata, \u201cConsistent estimation of the basic neighborhood of markov random\n\n\ufb01elds,\u201d The Annals of Statistics, pp. 123\u2013145, 2006.\n\n[6] N. P. Santhanam and M. J. Wainwright, \u201cInformation-theoretic limits of selecting binary\ngraphical models in high dimensions,\u201d Info. Theory, IEEE Trans. on, vol. 58, no. 7, pp. 4117\u2013\n4134, 2012.\n\n[7] M. Kearns, \u201cEcient noise-tolerant learning from statistical queries,\u201d Journal of the ACM\n\n(JACM), vol. 45, no. 6, pp. 983\u20131006, 1998.\n\n[8] C. Chow and C. Liu, \u201cApproximating discrete probability distributions with dependence trees,\u201d\n\nInformation Theory, IEEE Transactions on, vol. 14, no. 3, pp. 462\u2013467, 1968.\n\n[9] S. Dasgupta, \u201cLearning polytrees,\u201d in Proceedings of the Fifteenth conference on Uncertainty\n\nin arti\ufb01cial intelligence, pp. 134\u2013141, Morgan Kaufmann Publishers Inc., 1999.\n\n[10] N. Srebro, \u201cMaximum likelihood bounded tree-width markov networks,\u201d in Proceedings of the\nSeventeenth conference on Uncertainty in arti\ufb01cial intelligence, pp. 504\u2013511, Morgan Kauf-\nmann Publishers Inc., 2001.\n\n[11] R. L. Dobrushin, \u201cPrescribing a system of random variables by conditional distributions,\u201d\n\nTheory of Probability &amp; Its Applications, vol. 15, no. 3, pp. 458\u2013486, 1970.\n\n[12] R. L. Dobrushin and S. B. Shlosman, \u201cConstructive criterion for the uniqueness of gibbs \ufb01eld,\u201d\n\nin Statistical physics and dynamical systems, pp. 347\u2013370, Springer, 1985.\n\n[13] J. Salas and A. D. Sokal, \u201cAbsence of phase transition for antiferromagnetic potts models via\nthe dobrushin uniqueness theorem,\u201d Journal of Statistical Physics, vol. 86, no. 3-4, pp. 551\u2013579,\n1997.\n\n[14] D. Gamarnik, D. A. Goldberg, and T. Weber, \u201cCorrelation decay in random decision networks,\u201d\n\nMathematics of Operations Research, vol. 39, no. 2, pp. 229\u2013261, 2013.\n\n[15] D. Gamarnik and D. Katz, \u201cCorrelation decay and deterministic fptas for counting list-\ncolorings of a graph,\u201d in Proceedings of the eighteenth annual ACM-SIAM symposium on\nDiscrete algorithms, pp. 1245\u20131254, Society for Industrial and Applied Mathematics, 2007.\n\n[16] D. Weitz, \u201cCounting independent sets up to the tree threshold,\u201d in Proceedings of the thirty-\n\neighth annual ACM symposium on Theory of computing, pp. 140\u2013149, ACM, 2006.\n\n[17] P. Netrapalli, S. Banerjee, S. Sanghavi, and S. Shakkottai, \u201cGreedy learning of markov network\n\nstructure,\u201d in 48th Allerton Conference, pp. 1295\u20131302, 2010.\n\n[18] A. Ray, S. Sanghavi, and S. Shakkottai, \u201cGreedy learning of graphical models with small\n\ngirth,\u201d in 50th Allerton Conference, 2012.\n\n[19] A. Anandkumar, V. Tan, F. Huang, and A. Willsky, \u201cHigh-dimensional structure estimation\nin Ising models: Local separation criterion,\u201d Ann. of Stat., vol. 40, no. 3, pp. 1346\u20131375, 2012.\n[20] R. Wu, R. Srikant, and J. Ni, \u201cLearning loosely connected Markov random \ufb01elds,\u201d Stochastic\n\nSystems, vol. 3, no. 2, pp. 362\u2013404, 2013.\n\n[21] P. Ravikumar, M. Wainwright, and J. Laerty, \u201cHigh-dimensional Ising model selection using\n\u00b81-regularized logistic regression,\u201d The Annals of Statistics, vol. 38, no. 3, pp. 1287\u20131319, 2010.\n[22] S.-I. Lee, V. Ganapathi, and D. Koller, \u201cEcient structure learning of markov networks using\nl 1-regularization,\u201d in Advances in neural Information processing systems, pp. 817\u2013824, 2006.\n[23] A. Jalali, C. C. Johnson, and P. D. Ravikumar, \u201cOn learning discrete graphical models using\n\ngreedy methods.,\u201d in NIPS, pp. 1935\u20131943, 2011.\n\n[24] A. Jalali, P. Ravikumar, V. Vasuki, S. Sanghavi, and U. ECE, \u201cOn learning discrete graphical\nmodels using group-sparse regularization,\u201d in Inter. Conf. on AI and Statistics (AISTATS),\nvol. 14, 2011.\n\n[25] A. Sinclair, P. Srivastava, and M. Thurley, \u201cApproximation algorithms for two-state anti-\nferromagnetic spin systems on bounded degree graphs,\u201d Journal of Statistical Physics, vol. 155,\nno. 4, pp. 666\u2013686, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1480, "authors": [{"given_name": "Guy", "family_name": "Bresler", "institution": "Massachusetts Institute of Technology"}, {"given_name": "David", "family_name": "Gamarnik", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Devavrat", "family_name": "Shah", "institution": "Massachusetts Institute of Technology"}]}