{"title": "On the Information Theoretic Limits of Learning Ising Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2303, "page_last": 2311, "abstract": "We provide a general framework for computing lower-bounds on the sample complexity of recovering the underlying graphs of Ising models, given i.i.d. samples. While there have been recent results for specific graph classes, these involve fairly extensive technical arguments that are specialized to each specific graph class. In contrast, we isolate two key graph-structural ingredients that can then be used to specify sample complexity lower-bounds. Presence of these structural properties makes the graph class hard to learn. We derive corollaries of our main result that not only recover existing recent results, but also provide lower bounds for novel graph classes not considered previously. We also extend our framework to the random graph setting and derive corollaries for Erdos-Renyi graphs in a certain dense setting.", "full_text": "On the Information Theoretic Limits\n\nof Learning Ising Models\n\nKarthikeyan Shanmugam1\u2217, Rashish Tandon2\u2020, Alexandros G. Dimakis1\u2021, Pradeep Ravikumar2(cid:63)\n\n1Department of Electrical and Computer Engineering, 2Department of Computer Science\n\nThe University of Texas at Austin, USA\n\n\u2217karthiksh@utexas.edu, \u2020rashish@cs.utexas.edu\n\n\u2021dimakis@austin.utexas.edu, (cid:63)pradeepr@cs.utexas.edu\n\nAbstract\n\nWe provide a general framework for computing lower-bounds on the sample com-\nplexity of recovering the underlying graphs of Ising models, given i.i.d. samples.\nWhile there have been recent results for speci\ufb01c graph classes, these involve fairly\nextensive technical arguments that are specialized to each speci\ufb01c graph class. In\ncontrast, we isolate two key graph-structural ingredients that can then be used to\nspecify sample complexity lower-bounds. Presence of these structural properties\nmakes the graph class hard to learn. We derive corollaries of our main result that\nnot only recover existing recent results, but also provide lower bounds for novel\ngraph classes not considered previously. We also extend our framework to the\nrandom graph setting and derive corollaries for Erd\u02ddos-R\u00e9nyi graphs in a certain\ndense setting.\n\n1\n\nIntroduction\n\nGraphical models provide compact representations of multivariate distributions using graphs that\nrepresent Markov conditional independencies in the distribution. They are thus widely used in a\nnumber of machine learning domains where there are a large number of random variables, including\nnatural language processing [13], image processing [6, 10, 19], statistical physics [11], and spatial\nstatistics [15]. In many of these domains, a key problem of interest is to recover the underlying\ndependencies, represented by the graph, given samples i.e. to estimate the graph of dependencies\ngiven instances drawn from the distribution. A common regime where this graph selection problem\nis of interest is the high-dimensional setting, where the number of samples n is potentially smaller\nthan the number of variables p. Given the importance of this problem, it is instructive to have\nlower bounds on the sample complexity of any estimator: it clari\ufb01es the statistical dif\ufb01culty of the\nunderlying problem, and moreover it could serve as a certi\ufb01cate of optimality in terms of sample\ncomplexity for any estimator that actually achieves this lower bound. We are particularly interested\nin such lower bounds under the structural constraint that the graph lies within a given class of graphs\n(such as degree-bounded graphs, bounded-girth graphs, and so on).\nThe simplest approach to obtaining such bounds involves graph counting arguments, and an appli-\ncation of Fano\u2019s lemma. [2, 17] for instance derive such bounds for the case of degree-bounded\nand power-law graph classes respectively. This approach however is purely graph-theoretic, and\nthus fails to capture the interaction of the graphical model parameters with the graph structural con-\nstraints, and thus typically provides suboptimal lower bounds (as also observed in [16]). The other\nstandard approach requires a more complicated argument through Fano\u2019s lemma that requires \ufb01nd-\ning a subset of graphs such that (a) the subset is large enough in number, and (b) the graphs in\nthe subset are close enough in a suitable metric, typically the KL-divergence of the corresponding\ndistributions. This approach is however much more technically intensive, and even for the simple\n\n1\n\n\fclasses of bounded degree and bounded edge graphs for Ising models, [16] required fairly extensive\narguments in using the above approach to provide lower bounds.\nIn modern high-dimensional settings, it is becoming increasingly important to incorporate structural\nconstraints in statistical estimation, and graph classes are a key interpretable structural constraint.\nBut a new graph class would entail an entirely new (and technically intensive) derivation of the\ncorresponding sample complexity lower bounds. In this paper, we are thus interested in isolating\nthe key ingredients required in computing such lower bounds. This key ingredient involves one\nthe following structural characterizations: (1) connectivity by short paths between pairs of nodes,\nor (2) existence of many graphs that only differ by an edge. As corollaries of this framework, we\nnot only recover the results in [16] for the simple cases of degree and edge bounded graphs, but\nto several more classes of graphs, for which achievability results have already been proposed[1].\nMoreover, using structural arguments allows us to bring out the dependence of the edge-weights, \u03bb,\non the sample complexity. We are able to show same sample complexity requirements for d-regular\ngraphs, as is for degree d-bounded graphs, whilst the former class is much smaller. We also extend\nour framework to the random graph setting, and as a corollary, establish lower bound requirements\nfor the class of Erd\u02ddos-R\u00e9nyi graphs in a dense setting. Here, we show that under a certain scaling\nof the edge-weights \u03bb, Gp,c/p requires exponentially many samples, as opposed to a polynomial\nrequirement suggested from earlier bounds[1].\n2 Preliminaries and De\ufb01nitions\nNotation: R represents the real line. [p] denotes the set of integers from 1 to p. Let 1S denote the\n\nvector of ones and zeros where S is the set of coordinates containing 1. Let A \u2212 B denote A(cid:84) Bc\n\n\u00af\u03b8i,j xixj\n\n(cid:32)(cid:80)\n\nand A\u2206B denote the symmetric difference for two sets A and B.\nIn this work, we consider the problem of learning the graph structure of an Ising model.\nIsing\nmodels are a class of graphical model distributions over binary vectors, characterized by the pair\n(G(V, E), \u00af\u03b8), where G(V, E) is an undirected graph on p vertices and \u00af\u03b8 \u2208 R(p\n2) : \u00af\u03b8i,j = 0 \u2200(i, j) /\u2208\nE, \u00af\u03b8i,j (cid:54)= 0 \u2200 (i, j) \u2208 E. Let X = {+1,\u22121}. Then, for the pair (G, \u00af\u03b8), the distribution on X p is\nwhere x \u2208 X p and Z is the normalization factor, also\ngiven as: fG,\u00af\u03b8(x) = 1\ni,j\nknown as the partition function.\nThus, we obtain a family of distributions by considering a set of edge-weighted graphs G\u03b8, where\neach element of G\u03b8 is a pair (G, \u00af\u03b8). In other words, every member of the class G\u03b8 is a weighted\nundirected graph. Let G denote the set of distinct unweighted graphs in the class G\u03b8.\nA learning algorithm that learns the graph G (and not the weights \u00af\u03b8) from n independent samples\n(each sample is a p-dimensional binary vector) drawn from the distribution fG,\u00af\u03b8(\u00b7), is an ef\ufb01ciently\ncomputable map \u03c6 : \u03c7np \u2192 G which maps the input samples {x1, . . . xn} to an undirected graph\n\u02c6G \u2208 G i.e. \u02c6G = \u03c6(x1, . . . , xn).\nWe now discuss two metrics of reliability for such an estimator \u03c6. For a given (G, \u00af\u03b8), the probability\n. Given a graph class G\u03b8, one\nof error (over the samples drawn) is given by p(G, \u00af\u03b8) = Pr\nmay consider the maximum probability of error for the map \u03c6, given as:\n\nZ exp\n\n(cid:33)\n\n(cid:16) \u02c6G (cid:54)= G\n(cid:17)\n(cid:16) \u02c6G (cid:54)= G\n(cid:17)\n\n.\n\npmax = max\n\n(G,\u03b8)\u2208G\u03b8\n\nPr\n\n(1)\n\n(2)\n\nThe goal of any estimator \u03c6 would be to achieve as low a pmax as possible. Alternatively, there are\nrandom graph classes that come naturally endowed with a probability measure \u00b5(G, \u03b8) of choosing\nthe graphical model. In this case, the quantity we would want to minimize would be the average\nprobability of error of the map \u03c6, given as:\n\n(cid:104)\n\n(cid:16) \u02c6G (cid:54)= G\n(cid:17)(cid:105)\n\npavg = E\u00b5\n\nPr\n\nIn this work, we are interested in answering the following question: For any estimator \u03c6, what is the\nminimum number of samples n, needed to guarantee an asymptotically small pmax or pavg ? The\nanswer depends on G\u03b8 and \u00b5(when applicable).\n\n2\n\n\fas D (fG(cid:107)fG(cid:48)) = (cid:80)\n\nFor the sake of simplicity, we impose the following restrictions1: We restrict to the set of zero-\ufb01eld\nferromagnetic Ising models, where zero-\ufb01eld refers to a lack of node weights, and ferromagnetic\nrefers to all positive edge weights. Further, we will restrict all the non-zero edge weights (\u00af\u03b8i,j) in\nthe graph classes to be the same, set equal to \u03bb > 0. Therefore, for a given G(V, E), we have\n\u00af\u03b8 = \u03bb1E for some \u03bb > 0. A deterministic graph class is described by a scalar \u03bb > 0 and the family\nof graphs G. In the case of a random graph class, we describe it by a scalar \u03bb > 0 and a probability\nmeasure \u00b5, the measure being solely on the structure of the graph G (on G).\nSince we have the same weight \u03bb(> 0) on all edges, henceforth we will skip the reference to it, i.e.\nthe graph class will simply be denoted G and for a given G \u2208 G, the distribution will be denoted\nby fG(\u00b7), with the dependence on \u03bb being implicit. Before proceeding further, we summarize the\nfollowing additional notation. For any two distributions fG and fG(cid:48), corresponding to the graphs\nG and G(cid:48) respectively, we denote the Kullback-Liebler divergence (KL-divergence) between them\n. For any subset T \u2286 G, we let CT (\u0001) denote an\n\u0001-covering w.r.t. the KL-divergence (of the corresponding distributions) i.e. CT (\u0001)(\u2286 G) is a set of\ngraphs such that for any G \u2208 T , there exists a G(cid:48) \u2208 CT (\u0001) satisfying D (fG(cid:107)fG(cid:48)) \u2264 \u0001. We denote\nthe entropy of any r.v. X by H(X), and the mutual information between any two r.v.s X and Y , by\nI(X; Y ). The rest of the paper is organized as follows. Section 3 describes Fano\u2019s lemma, a basic\ntool employed in computing information-theoretic lower bounds. Section 4 identi\ufb01es key structural\nproperties that lead to large sample requirements. Section 5 applies the results of Sections 3 and\n4 on a number of different deterministic graph classes to obtain lower bound estimates. Section 6\nobtains lower bound estimates for Erd\u02ddos-R\u00e9nyi random graphs in a dense regime. All proofs can be\nfound in the Appendix (see supplementary material).\n3 Fano\u2019s Lemma and Variants\n\n(cid:16) fG(x)\n\nx\u2208X p fG(x) log\n\n(cid:17)\n\nfG(cid:48) (x)\n\nFano\u2019s lemma [5] is a primary tool for obtaining bounds on the average probability of error, pavg. It\nprovides a lower bound on the probability of error of any estimator \u03c6 in terms of the entropy H(\u00b7)\nof the output space, the cardinality of the output space, and the mutual information I(\u00b7 , \u00b7) between\nthe input and the output. The case of pmax is interesting only when we have a deterministic graph\nclass G, and can be handled through Fano\u2019s lemma again by considering a uniform distribution on\nthe graph class.\nLemma 1 (Fano\u2019s Lemma). Consider a graph class G with measure \u00b5. Let, G \u223c \u00b5, and let X n =\n{x1, . . . , xn} be n independent samples such that xi \u223c fG, i \u2208 [n]. Then, for pmax and pavg as\nde\ufb01ned in (1) and (2) respectively,\n\npmax \u2265 pavg \u2265 H(G) \u2212 I(G; X n) \u2212 log 2\n\nlog|G|\n\n(3)\n\nThus in order to use this Lemma, we need to bound two quantities: the entropy H(G), and the mutual\ninformation I(G; X n). The entropy can typically be obtained or bounded very simply; for instance,\nwith a uniform distribution over the set of graphs G, H(G) = log |G|. The mutual information is\na much trickier object to bound however, and is where the technical complexity largely arises. We\ncan however simply obtain the following loose bound: I(G; X n) \u2264 H(X n) \u2264 np. We thus arrive\nat the following corollary:\nCorollary 1. Consider a graph class G. Then, pmax \u2265 1 \u2212 np+log 2\n.\nlog|G|\nRemark 1. From Corollary 1, we get: If n \u2264 log|G|\n, then pmax \u2265 \u03b4. Note that\n(1 \u2212 \u03b4) \u2212 log 2\nlog|G|\nthis bound on n is only in terms of the cardinality of the graph class G, and therefore, would not\ninvolve any dependence on \u03bb (and consequently, be very loose).\n\n(cid:16)\n\n(cid:17)\n\np\n\nTo obtain sharper lower bound guarantees that depends on graphical model parameters, it is useful\nto consider instead a conditional form of Fano\u2019s lemma[1, Lemma 9], which allows us to obtain\nlower bounds on pavg in terms conditional analogs of the quantities in Lemma 1. For the case of\npmax, these conditional analogs correspond to uniform measures on subsets of the original class G.\n1Note that a lower bound for a restricted subset of a class of Ising models will also serve as a lower bound\n\nfor the class without that restriction.\n\n3\n\n\fThe conditional version allows us to focus on potentially harder to learn subsets of the graph class,\nleading to sharper lower bound guarantees. Also, for a random graph class, the entropy H(G) may\nbe asymptotically much smaller than the log cardinality of the graph class, log|G| (e.g. Erd\u02ddos-R\u00e9nyi\nrandom graphs; see Section 6), rendering the bound in Lemma 1 useless. The conditional version\nallows us to circumvent this issue by focusing on a high-probability subset of the graph class.\nLemma 2 (Conditional Fano\u2019s Lemma). Consider a graph class G with measure \u00b5. Let, G \u223c \u00b5,\nand let X n = {x1, . . . , xn} be n independent samples such that xi \u223c fG, i \u2208 [n]. Consider any\nT \u2286 G and let \u00b5 (T ) be the measure of this subset i.e. \u00b5 (T ) = Pr\u00b5 (G \u2208 T ). Then, we have\n\nH(G|G \u2208 T ) \u2212 I(G; X n|G \u2208 T ) \u2212 log 2\n\npavg \u2265 \u00b5 (T )\npmax \u2265 H(G|G \u2208 T ) \u2212 I(G; X n|G \u2208 T ) \u2212 log 2\n\nlog|T |\n\nlog|T |\n\nand,\n\nGiven Lemma 2, or even Lemma 1, it is the sharpness of an upper bound on the mutual information\nthat governs the sharpness of lower bounds on the probability of error (and effectively, the number of\nsamples n). In contrast to the trivial upper bound used in the corollary above, we next use a tighter\nbound from [20], which relates the mutual information to coverings in terms of the KL-divergence,\napplied to Lemma 2. Note that, as stated earlier, we simply impose a uniform distribution on G when\ndealing with pmax. Analogous bounds can be obtained for pavg.\nCorollary 2. Consider a graph class G, and any T \u2286 G. Recall the de\ufb01nition of CT (\u0001) from Section\n\n2. For any \u0001 > 0, we have pmax \u2265(cid:16)\n\nlog|T |\nRemark 2. From Corollary 2, we get: If n \u2264 log|T |\n\u03b4. \u0001 is an upper bound on the radius of the KL-balls in the covering, and usually varies with \u03bb.\n\nlog|T | \u2212 log|CT (\u0001)|\nlog|T |\n\n(1 \u2212 \u03b4) \u2212 log 2\n\n1 \u2212 log|CT (\u0001)|+n\u0001+log 2\n\n, then pmax \u2265\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\n.\n\n\u0001\n\n\u03c1\n\n(cid:17)\n\n(cid:16)\n\n(1 \u2212 \u03b4) \u2212 log 2\nlog|T |\n\nBut this corollary cannot be immediately used given a graph class: it requires us to specify a subset\nT of the overall graph class, the term \u0001, and the KL-covering CT (\u0001).\nWe can simplify the bound above by setting \u0001 to be the radius of a single KL-ball w.r.t. some center,\ncovering the whole set T . Suppose this radius is \u03c1, then the size of the covering set is just 1. In this\ncase, from Remark 2, we get: If n \u2264 log|T |\n, then pmax \u2265 \u03b4. Thus, our goal\nin the sequel would be to provide a general mechanism to derive such a subset T : that is large in\nnumber and yet has small diameter with respect to KL-divergence.\nWe note that Fano\u2019s lemma and variants described in this section are standard, and have been applied\nto a number of problems in statistical estimation [1, 14, 16, 20, 21].\n4 Structural conditions governing Correlation\nAs discussed in the previous section, we want to \ufb01nd subsets T that are large in size, and yet have\na small KL-diameter. In this section, we summarize certain structural properties that result in small\nKL-diameter. Thereafter, \ufb01nding a large set T would amount to \ufb01nding a large number of graphs in\nthe graph class G that satisfy these structural properties.\nAs a \ufb01rst step, we need to get a sense of when two graphs would have corresponding distributions\nwith a small KL-divergence. To do so, we need a general upper bound on the KL-divergence be-\ntween the corresponding distributions. A simple strategy is to simply bound it by its symmetric\ndivergence[16]. In this case, a little calculation shows :\n\nD (fG(cid:107)fG(cid:48)) \u2264 D (fG(cid:107)fG(cid:48)) + D (fG(cid:48)(cid:107)fG)\n\n(cid:88)\n\n=\n\n(s,t)\u2208E\\E(cid:48)\n\n(cid:88)\n\n(s,t)\u2208E(cid:48)\\E\n\n\u03bb (EG [xsxt] \u2212 EG(cid:48) [xsxt]) +\n\n\u03bb (EG(cid:48) [xsxt] \u2212 EG [xsxt])\n\n(4)\nwhere E and E(cid:48) are the edges in the graphs G and G(cid:48) respectively, and EG[\u00b7] denotes the expectation\nunder fG. Also note that the correlation between xs and xt, EG[xsxt] = 2PG(xsxt = +1) \u2212 1.\n\n4\n\n\fFrom Eq. (4), we observe that the only pairs, (s, t), contributing to the KL-divergence are the ones\nthat lie in the symmetric difference, E\u2206E(cid:48). If the number of such pairs is small, and the difference of\ncorrelations in G and G(cid:48) (i.e. EG [xsxt]\u2212EG(cid:48) [xsxt]) for such pairs is small, then the KL-divergence\nwould be small.\nTo summarize the setting so far, to obtain a tight lower bound on sample complexity for a class of\ngraphs, we need to \ufb01nd a subset of graphs T with small KL diameter. The key to this is to identify\nwhen KL divergence between (distributions corresponding to) two graphs would be small. And the\nkey to this in turn is to identify when there would be only a small difference in the correlations\nbetween a pair of variables across the two graphs G and G(cid:48). In the subsequent subsections, we\nprovide two simple and general structural characterizations that achieve such a small difference of\ncorrelations across G and G(cid:48).\n4.1 Structural Characterization with Large Correlation\n\nOne scenario when there might be a small difference in correlations is when one of the correlations\nis very large, speci\ufb01cally arbitrarily close to 1, say EG(cid:48)[xsxt] \u2265 1 \u2212 \u0001, for some \u0001 > 0. Then,\nEG[xsxt] \u2212 EG(cid:48)[xsxt] \u2264 \u0001, since EG[xsxt] \u2264 1. Indeed, when s, t are part of a clique[16], this is\nachieved since the large number of connections between them force a higher probability of agree-\nment i.e. PG(xsxt = +1) is large.\nIn this work we provide a more general characterization of when this might happen by relying on the\nfollowing key lemma that connects the presence of \u201cmany\u201d node disjoint \u201cshort\u201d paths between a\npair of nodes in the graph to high correlation between them. We de\ufb01ne the property formally below.\nDe\ufb01nition 1. Two nodes a and b in an undirected graph G are said to be ((cid:96), d) connected if they\nhave d node disjoint paths of length at most (cid:96).\nLemma 3. Consider a graph G and a scalar \u03bb > 0. Consider the distribution fG(x) induced by\nthe graph. If a pair of nodes a and b are ((cid:96), d) connected, then EG [xaxb] \u2265 1 \u2212\n.\n\n2\n\n1+\n\n(1+(tanh(\u03bb))(cid:96))d\n(1\u2212(tanh(\u03bb))(cid:96))d\n\nFrom the above lemma, we can observe that as (cid:96) gets smaller and d gets larger, EG [xaxb] approaches\nits maximum value of 1. As an example, in a k-clique, any two vertices, s and t, are (2, k \u2212 1)\nconnected. In this case, the bound from Lemma 3 gives us: EG [xaxb] \u2265 1 \u2212\n1+(cosh \u03bb)k\u22121 . Of\n\ncourse, a clique enjoys a lot more connectivity (i.e. also(cid:0)3, k\u22121\n\n(cid:1) connected etc., albeit with node\n\noverlaps) which allows for a stronger bound of \u223c 1 \u2212 \u03bbke\u03bb\nNow, as discussed earlier, a high correlation between a pair of nodes contributes a small term to the\nKL-divergence. This is stated in the following corollary.\nCorollary 3. Consider two graphs G(V, E) and G(cid:48)(V, E(cid:48)) and scalar weight \u03bb > 0 such that\nE \u2212 E(cid:48) and E(cid:48) \u2212 E only contain pairs of nodes that are ((cid:96), d) connected in graphs G(cid:48) and G\nrespectively, then the KL-divergence between fG and fG(cid:48), D (fG(cid:107)fG(cid:48)) \u2264\n\ne\u03bbk (see [16])2\n\n.\n\n2\n\n2\n\n2\u03bb|E\u2206E(cid:48)|\n(1+(tanh(\u03bb))(cid:96))d\n(1\u2212(tanh(\u03bb))(cid:96))d\n\n4.2 Structural Characterization with Low Correlation\n\n1+\n\nAnother scenario where there might be a small difference in correlations between an edge pair across\ntwo graphs is when the graphs themselves are close in Hamming distance i.e. they differ by only a\nfew edges. This is formalized below for the situation when they differ by only one edge.\nDe\ufb01nition 2 (Hamming Distance). Consider two graphs G(V, E) and G(cid:48)(V, E(cid:48)). The hamming\ndistance between the graphs, denoted by H(G, G(cid:48)), is the number of edges where the two graphs\ndiffer i.e.\n(5)\nLemma 4. Consider two graphs G(V, E) and G(cid:48)(V, E(cid:48)) such that H(G, G(cid:48)) = 1, and (a, b) \u2208 E\nis the single edge in E\u2206E(cid:48). Then, EfG [xaxb] \u2212 EfG(cid:48) [xaxb] \u2264 tanh(\u03bb). Also, the KL-divergence\nbetween the distributions, D (fG(cid:107)f(cid:48)\n\nH(G, G(cid:48)) = |{(s, t)| (s, t) \u2208 E\u2206E(cid:48)}|\n\nG) \u2264 \u03bb tanh(\u03bb).\n\n2Both the bound from [16] and the bound from Lemma 3 have exponential asymptotic behaviour (i.e. as k\ngrows) for constant \u03bb. For smaller \u03bb, the bound from [16] is strictly better. However, not all graph classes allow\nfor the presence of a large enough clique, for e.g., girth bounded graphs, path restricted graphs, Erd\u02ddos-R\u00e9nyi\ngraphs.\n\n5\n\n\fThe above bound is useful in low \u03bb settings. In this regime \u03bb tanh \u03bb roughly behaves as \u03bb2. So, a\nsmaller \u03bb would correspond to a smaller KL-divergence.\n4.3 In\ufb02uence of Structure on Sample Complexity\n\nNow, we provide some high-level intuition behind why the structural characterizations above would\nbe useful for lower bounds that go beyond the technical reasons underlying Fano\u2019s Lemma that we\nhave speci\ufb01ed so far. Let us assume that \u03bb > 0 is a positive real constant. In a graph even when the\nedge (s, t) is removed, (s, t) being ((cid:96), d) connected ensures that the correlation between s and t is\nstill very high (exponentially close to 1). Therefore, resolving the question of the presence/absence\nof the edge (s, t) would be dif\ufb01cult \u2013 requiring lots of samples. This is analogous in principle to\nthe argument in [16] used for establishing hardness of learning of a set of graphs each of which is\nobtained by removing a single edge from a clique, still ensuring many short paths between any two\nvertices. Similarly, if the graphs, G and G(cid:48), are close in Hamming distance, then their corresponding\ndistributions, fG and fG(cid:48), also tend to be similar. Again, it becomes dif\ufb01cult to tease apart which\ndistribution the samples observed may have originated from.\n5 Application to Deterministic Graph Classes\n\nIn this section, we provide lower bound estimates for a number of deterministic graph families. This\nis done by explicitly \ufb01nding a subset T of the graph class G, based on the structural properties of\nthe previous section. See the supplementary material for details of these constructions. A common\nunderlying theme to all is the following: We try to \ufb01nd a graph in G containing many edge pairs\n(u, v) such that their end vertices, u and v, have many paths between them (possibly, node disjoint).\nOnce we have such a graph, we construct a subset T by removing one of the edges for these well-\nconnected edge pairs. This ensures that the new graphs differ from the original in only the well-\nconnected pairs. Alternatively, by removing any edge (and not just well-connected pairs) we can get\nanother larger family T which is 1-hamming away from the original graph.\n5.1 Path Restricted Graphs\nLet Gp,\u03b7 be the class of all graphs on p vertices with have at most \u03b7 paths (\u03b7 = o(p)) between any\ntwo vertices. We have the following theorem :\nTheorem 1. For the class Gp,\u03b7, if n \u2264 (1 \u2212 \u03b4) max\npmax \u2265 \u03b4.\nTo understand the scaling, it is useful to think of cosh(2\u03bb) to be roughly exponential in \u03bb2 i.e.\ncosh(2\u03bb) \u223c e\u0398(\u03bb2)3. In this case, from the second term, we need n \u223c \u2126\nsamples.\n\u221a\nIf \u03b7 is scaling with p, this can be prohibitively large (exponential in \u03bb2\u03b7). Thus, to have low sample\ncomplexity, we must enforce \u03bb = O(1/\n\u03b7). In this case, the \ufb01rst term gives n = \u2126(\u03b7 log p), since\n\u03bb tanh(\u03bb) \u223c \u03bb2, for small \u03bb.\nWe may also consider a generalization of Gp,\u03b7. Let Gp,\u03b7,\u03b3 be the set of all graphs on p vertices such\nthat there are at most \u03b7 paths of length at most \u03b3 between any two nodes (with \u03b7 + \u03b3 = o(p)). Note\nthat there may be more paths of length > \u03b3.\nTheorem 2. Consider the graph class Gp,\u03b7,\u03b3. For any \u03bd \u2208 (0, 1), let t\u03bd = p1\u2212\u03bd\u2212(\u03b7+1)\n(1 \u2212 \u03b4) max\n\n(cid:110) log(p/2)\n\u03bb tanh \u03bb , 1+cosh(2\u03bb)\u03b7\u22121\n(cid:16) e\u03bb2\u03b7\n\n(cid:16) p\n(cid:17)(cid:17)\n(cid:16) p\n\n1+tanh(\u03bb)\u03b3+1\n1\u2212tanh(\u03bb)\u03b3+1\n\n. If n \u2264\n\ncosh(2\u03bb)\u03b7\u22121\n\n(cid:17)(cid:111)\n\n(cid:19)t\u03bd(cid:21)\n\n\u03bd log(p)\n\n\u03bb log\n\n\u03b7\n\n(cid:20)\n\n1+\n\nlog\n\n2(\u03b7+1)\n\n, then\n\n2\u03bb\n\n(cid:18)\n\n2\u03bb\n\n\u03b3\n\n\uf8fc\uf8fd\uf8fe, then pmax \u2265 \u03b4.\n(cid:17)\n\n(cid:16) 1+tanh(\u03bb)\u03b3+1\n\n\u03bb tanh \u03bb ,\n\n\uf8f1\uf8f2\uf8f3 log(p/2)\n(cid:16) log p\n\u03bb tanh \u03bb , ec\u03bb\u03b3+1\u221a\n\n\u03bb\n\nThe parameter \u03bd \u2208 (0, 1) in the bound above may be adjusted based on the scaling of \u03b7 and \u03b3.\nis \u223c e\u03bb\u03b3+1. As an example,\nAlso, an approximate way to think of the scaling of\n2. In this case, for some constant c, our bound imposes\nfor constant \u03b7 and \u03b3, we may choose v = 1\nn \u223c \u2126\n. Now, same as earlier, to have low sample complexity, we must\n\n1\u2212tanh(\u03bb)\u03b3+1\n\n(cid:17)\n\nlog p\n\np\n\n3In fact, for \u03bb \u2264 3, we have e\u03bb2/2 \u2264 cosh(2\u03bb) \u2264 e2\u03bb2. For \u03bb > 3, cosh(2\u03bb) > 200\n\n6\n\n\fhave \u03bb = O(1/p1/2(\u03b3+1)), in which case, we get a n \u223c \u2126(p1/(\u03b3+1) log p) sample requirement from\nthe \ufb01rst term.\nWe note that the family Gp,\u03b7,\u03b3 is also studied in [1], and for which, an algorithm is proposed. Under\ncertain assumptions in [1], and the restrictions: \u03b7 = O(1), and \u03b3 is large enough, the algorithm in\n[1] requires log p\nsamples, which is matched by the \ufb01rst term in our lower bound. Therefore, the\n\u03bb2\nalgorithm in [1] is optimal, for the setting considered.\n5.2 Girth Bounded Graphs\nThe girth of a graph is de\ufb01ned as the length of its shortest cycle. Let Gp,g,d be the set of all graphs\nwith girth atleast g, and maximum degree d. Note that as girth increases the learning problem\nbecomes easier, with the extreme case of g = \u221e (i.e. trees) being solved by the well known Chow-\nLiu algorithm[3] in O(log p) samples. We have the following theorem:\nTheorem 3. Consider the graph class Gp,g,d. For any \u03bd \u2208 (0, 1), let d\u03bd = min\n\nd, p1\u2212\u03bd\n\n(cid:16)\n\n(cid:17)\n\n. If\n\ng\n\n\uf8f1\uf8f2\uf8f3 log(p/2)\n\n\u03bb tanh \u03bb ,\n\n(cid:18)\n\n1+\n\n(cid:19)d\u03bd\n\n1+tanh(\u03bb)g\u22121\n1\u2212tanh(\u03bb)g\u22121\n\n2\u03bb\n\n\u03bd log(p)\n\nn \u2264 (1 \u2212 \u03b4) max\n\nbe the set of all graphs whose vertices have degree d or degree d \u2212 1. Note that this class\n\n5.3 Approximate d-Regular Graphs\nLet Gapprox\nis subset of the class of graphs with degree at most d. We have:\nTheorem 4. Consider the class Gapprox\n\n. If n \u2264 (1\u2212\u03b4) max\n\np,d\n\np,d\n\nthen pmax \u2265 \u03b4.\n\n\uf8fc\uf8fd\uf8fe, then pmax \u2265 \u03b4.\n(cid:26) log( pd\n(cid:16) pd\n\n4 )\n\n\u03bb tanh \u03bb , e\u03bbd\n\n2\u03bbde\u03bb\n\n4\n\n(cid:17)(cid:27)\n\nNote that the second term in the bound above is from [16]. Now, restricting \u03bb to prevent exponential\ngrowth in the number of samples, we get a sample requirement of n = \u2126(d2 log p). This matches\nthe lower bound for degree d bounded graphs in [16]. However, note that Theorem 4 is stronger in\nthe sense that the bound holds for a smaller class of graphs i.e. only approximately d-regular, and\nnot d-bounded.\n5.4 Approximate Edge Bounded Graphs\nLet Gapprox\nof graphs with edges at most k. Here, we have:\nTheorem 5. Consider the class Gapprox\n\nbe the set of all graphs with number of edges \u2208(cid:2) k\n(cid:26) log( k\n\n2 , k(cid:3). This class is a subset of the class\n\n, and let k \u2265 9. If we have number of samples n \u2264 (1 \u2212\n, then pmax \u2265 \u03b4.\n\nlog(cid:0) k\n\n(cid:1)(cid:27)\n\n\u221a\n2k\u22121)\n\u221a\n\n\u03b4) max\n\np,k\n\np,k\n\n2 )\n\u03bb tanh \u03bb ,\n\ne\u03bb(\n2\u03bbe\u03bb(\n\n2k+1)\n\n2\n\nNote that the second term in the bound above is from [16]. If we restrict \u03bb to prevent exponential\ngrowth in the number of samples, we get a sample requirement of n = \u2126(k log k). Again, we match\nthe lower bound for the edge bounded class in [16], but through a smaller class.\n6 Erd\u02ddos-R\u00e9nyi graphs G(p, c/p)\nIn this section, we relate the number of samples required to learn G \u223c G(p, c/p) for the dense case,\nfor guaranteeing a constant average probability of error pavg. We have the following main result\nwhose proof can be found in the Appendix.\nTheorem 6. Let G \u223c G(p, c/p), c = \u2126(p3/4 + \u0001(cid:48)), \u0001(cid:48) > 0. For this class of random graphs, if\npavg \u2264 1/90, then n \u2265 max (n1, n2) where:\n\nn1 =\n\n\uf8eb\uf8ed 4\u03bbp\n\nH(c/p)(3/80) (1 \u2212 80pavg \u2212 O(1/p))\n\n3 exp(\u2212 p\n\n36 ) + 4 exp(\u2212 p\n\n3\n2\n\n144 ) +\n\n\uf8f6\uf8f8 , n2 =\n\nH(c/p)(1 \u2212 3pavg) \u2212 O(1/p)\n\np\n4\n\n(6)\n\n(cid:18)\n\n9\n\n4\u03bb\n\n1+(cosh(2\u03bb))\n\n(cid:19)\n\nc2\n6p\n\n7\n\n\fRemark 3. In the denominator of the \ufb01rst expression, the dominating term is\n\n(cid:18)\n\n9\n\n4\u03bb\n\n1+(cosh(2\u03bb))\n\n(cid:19) .\n\nc2\n6p\n\nTherefore, we have the following corollary.\nCorollary 4. Let G \u223c G(p, c/p), c = \u2126(p3/4+\u0001(cid:48)\n\n) for any \u0001(cid:48) > 0. Let pavg \u2264 1/90, then\n\n(cid:16)\n\n\u221a\n1. \u03bb = \u2126(\np/c) : \u2126\n\u221a\n\n(cid:17)\n\nc2\n6p\n\n\u03bbH(c/p)(cosh(2\u03bb))\n\nsamples are needed.\n\n2. \u03bb < O(\n\n\u221a\nRemark 4. This means that when \u03bb = \u2126(\n\nsamples are required. Hence, for any ef\ufb01cient algorithm, we require \u03bb = O(cid:0)\u221a\n\np/c) : \u2126(c log p) samples are needed. (This bound is from [1] )\n\np/c(cid:1) and in this\n\np/c), a huge number (exponential for constant \u03bb) of\n\nregime O (c log p) samples are required to learn.\n\n6.1 Proof Outline\nThe proof skeleton is based on Lemma 2. The essence of the proof is to cover a set of graphs T ,\nwith large measure, by an exponentially small set where the KL-divergence between any covered\nand the covering graph is also very small. For this we use Corollary 3. The key steps in the proof\nare outlined below:\n\n2 , cp\n\n2 + cp\u0001\n\n2 \u2212 cp\u0001\n\n\u00b5(T ) = 1 \u2212 o(1). A natural candidate is the \u2019typical\u2019 set T p\ngraphs each with ( cp\n2 ) edges in the graph.\n\n1. We identify a subclass of graphs T , as in Lemma 2, whose measure is close to 1, i.e.\n\u0001 which is de\ufb01ned to be a set of\n2. (Path property) We show that most graphs in T have property R: there are O(p2) pairs of\np ) node disjoint paths of length 2 with\n\nnodes such that every pair is well connected by O( c2\nhigh probability. The measure \u00b5(R|T ) = 1 \u2212 \u03b41.\n\n3. (Covering with low diameter) Every graph G in R(cid:84)T is covered by a graph G(cid:48) from\n\na covering set CR(\u03b42) such that their edge set differs only in the O(p2) nodes that are\nwell connected. Therefore, by Corollary 3, KL-divergence between G and G(cid:48) is very small\n(\u03b42 = O(\u03bbp2 cosh(\u03bb)\u2212c2/p)).\n\n4. (Ef\ufb01cient covering in Size) Further, the covering set CR is exponentially smaller than T .\n5. (Uncovered graphs have exponentially low measure) Then we show that the uncovered\n\ngraphs have large KL-divergence(cid:0)O(p2\u03bb)(cid:1) but their measure \u00b5(Rc |T ) is exponentially\n\nsmall.\n\n6. Using a similar (but more involved) expression for probability of error as in Corollary 2,\n\nroughly we need O( log|T|\n\n\u03b41+\u03b42\n\n) samples.\n\nThe above technique is very general. Potentially this could be applied to other random graph classes.\n\n7 Summary\n\nIn this paper, we have explored new approaches for computing sample complexity lower bounds\nfor Ising models. By explicitly bringing out the dependence on the weights of the model, we have\nshown that unless the weights are restricted, the model may be hard to learn. For example, it is hard\nto learn a graph which has many paths between many pairs of vertices, unless \u03bb is controlled. For the\nrandom graph setting, Gp,c/p, while achievability is possible in the c = poly log p case[1], we have\nshown lower bounds for c > p0.75. Closing this gap remains a problem for future consideration.\nThe application of our approaches to other deterministic/random graph classes such as the Chung-\nLu model[4] (a generalization of Erd\u02ddos-R\u00e9nyi graphs), or small-world graphs[18] would also be\ninteresting.\n\nAcknowledgments\n\nR.T. and P.R. acknowledge the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803,\nIIS-1320894, IIS-1447574, and DMS-1264033. K.S. and A.D. acknowledge the support of NSF via\nCCF 1422549, 1344364, 1344179 and DARPA STTR and a ARO YIP award.\n\n8\n\n\fReferences\n[1] Animashree Anandkumar, Vincent YF Tan, Furong Huang, Alan S Willsky, et al. High-\ndimensional structure estimation in ising models: Local separation criterion. The Annals of\nStatistics, 40(3):1346\u20131375, 2012.\n\n[2] Guy Bresler, Elchanan Mossel, and Allan Sly. Reconstruction of markov random \ufb01elds from\nsamples: Some observations and algorithms. In Proceedings of the 11th international work-\nshop, APPROX 2008, and 12th international workshop, RANDOM 2008 on Approximation,\nRandomization and Combinatorial Optimization: Algorithms and Techniques, APPROX \u201908 /\nRANDOM \u201908, pages 343\u2013356. Springer-Verlag, 2008.\n\n[3] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees.\n\nIEEE Trans. Inf. Theor., 14(3):462\u2013467, September 2006.\n\n[4] Fan Chung and Linyuan Lu. Complex Graphs and Networks. American Mathematical Society,\n\nAugust 2006.\n\n[5] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in\n\nTelecommunications and Signal Processing). Wiley-Interscience, 2006.\n\n[6] G. Cross and A. Jain. Markov random \ufb01eld texture models. IEEE Trans. PAMI, 5:25\u201339, 1983.\n[7] Amir Dembo and Andrea Montanari. Ising models on locally tree-like graphs. The Annals of\n\nApplied Probability, 20(2):565\u2013592, 04 2010.\n\n[8] Abbas El Gamal and Young-Han Kim. Network information theory. Cambridge University\n[9] Ashish Goel, Michael Kapralov, and Sanjeev Khanna. Perfect matchings in o(n\\logn) time in\n\nPress, 2011.\n\nregular bipartite graphs. SIAM Journal on Computing, 42(3):1392\u20131404, 2013.\n\n[10] M. Hassner and J. Sklansky. Markov random \ufb01eld models of digitized image texture.\n\nICPR78, pages 538\u2013540, 1978.\n\nIn\n\n[11] E. Ising. Beitrag zur theorie der ferromagnetismus. Zeitschrift f\u00fcr Physik, 31:253\u2013258, 1925.\n[12] Stasys Jukna. Extremal combinatorics, volume 2. Springer, 2001.\n[13] C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT\n\nPress, 1999.\n\n[14] Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for high-\ndimensional linear regression over (cid:96)q-balls. IEEE Trans. Inf. Theor., 57(10):6976\u20136994, Octo-\nber 2011.\n\n[15] B. D. Ripley. Spatial statistics. Wiley, New York, 1981.\n[16] Narayana P Santhanam and Martin J Wainwright.\n\nbinary graphical models in high dimensions.\n58(7):4117\u20134134, 2012.\n\nInformation-theoretic limits of selecting\nInformation Theory, IEEE Transactions on,\n\n[17] R. Tandon and P. Ravikumar. On the dif\ufb01culty of learning power law graphical models. In In\n\nIEEE International Symposium on Information Theory (ISIT), 2013.\n\n[18] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of \u2019small-world\u2019 networks.\n\nNature, 393(6684):440\u2013442, June 1998.\n\n[19] J.W. Woods. Markov image modeling. IEEE Transactions on Automatic Control, 23:846\u2013850,\n\nOctober 1978.\n\n[20] Yuhong Yang and Andrew Barron. Information-theoretic determination of minimax rates of\n\nconvergence. Annals of Statistics, pages 1564\u20131599, 1999.\n\n[21] Yuchen Zhang, John Duchi, Michael Jordan, and Martin J Wainwright. Information-theoretic\nIn Ad-\nlower bounds for distributed statistical estimation with communication constraints.\nvances in Neural Information Processing Systems 26, pages 2328\u20132336. Curran Associates,\nInc., 2013.\n\n9\n\n\f", "award": [], "sourceid": 1219, "authors": [{"given_name": "Rashish", "family_name": "Tandon", "institution": "The University of Texas at Austin"}, {"given_name": "Karthikeyan", "family_name": "Shanmugam", "institution": "The University of Texas at Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "The University of Texas at Austin"}]}