{"title": "Concentration of Multilinear Functions of the Ising Model with Applications to Network Data", "book": "Advances in Neural Information Processing Systems", "page_first": 12, "page_last": 23, "abstract": "We prove near-tight concentration of measure for polynomial functions of the Ising model, under high temperature, improving the radius of concentration guaranteed by known results by polynomial factors in the dimension (i.e.~the number of nodes in the Ising model). We show that our results are optimal up to logarithmic factors in the dimension. We obtain our results by extending and strengthening the exchangeable-pairs approach used to prove concentration of measure in this setting by Chatterjee. We demonstrate the efficacy of such functions as statistics for testing the strength of interactions in social networks in both synthetic and real world data.", "full_text": "Concentration of Multilinear Functions of the Ising\n\nModel with Applications to Network Data\n\nConstantinos Daskalakis \u2217\nEECS & CSAIL, MIT\n\ncostis@csail.mit.edu\n\nNishanth Dikkala\u2217\nEECS & CSAIL, MIT\n\nnishanthd@csail.mit.edu\n\nGautam Kamath\u2217\nEECS & CSAIL, MIT\ng@csail.mit.edu\n\nAbstract\n\nWe prove near-tight concentration of measure for polynomial functions of the\nIsing model under high temperature. For any degree d, we show that a degree-\nd polynomial of a n-spin Ising model exhibits exponential tails that scale as\nexp(\u2212r2/d) at radius r = \u02dc\u2126d(nd/2). Our concentration radius is optimal up to\nlogarithmic factors for constant d, improving known results by polynomial factors\nin the number of spins. We demonstrate the ef\ufb01cacy of polynomial functions as\nstatistics for testing the strength of interactions in social networks in both synthetic\nand real world data.\n\n1\n\nIntroduction\n\nThe Ising model is a fundamental probability distribution de\ufb01ned in terms of a graph G = (V, E)\nwhose nodes and edges are associated with scalar parameters (\u03b8v)v\u2208V and (\u03b8u,v){u,v}\u2208E respectively.\nThe distribution samples a vector x \u2208 {\u00b11}V with probability:\n\np(x) = exp\n\n\u03b8vxv +\n\n\u03b8u,vxuxv \u2212 \u03a6\n\n(1)\n\n\uf8eb\uf8ed(cid:88)\n\nv\u2208V\n\n(cid:88)\n\n(u,v)\u2208E\n\n(cid:17)\uf8f6\uf8f8 ,\n(cid:16)(cid:126)\u03b8\n\n(cid:16)(cid:126)\u03b8\n(cid:17)\n\nwhere \u03a6\nserves to provide normalization. Roughly speaking, there is a random variable Xv at\nevery node of G, and this variable may be in one of two states, or spins: up (+1) or down (\u22121). The\nscalar parameter \u03b8v models a local \ufb01eld at node v. The sign of \u03b8v represents whether this local \ufb01eld\nfavors Xv taking the value +1, i.e. the up spin, when \u03b8v > 0, or the value \u22121, i.e. the down spin,\nwhen \u03b8v < 0, and its magnitude represents the strength of the local \ufb01eld. Similarly, \u03b8u,v represents\nthe direct interaction between nodes u and v. Its sign represents whether it favors equal spins, when\n\u03b8u,v > 0, or opposite spins, when \u03b8u,v < 0, and its magnitude corresponds to the strength of the\ndirect interaction. Of course, depending on the structure of G and the node and edge parameters, there\nmay be indirect interactions between nodes, which may overwhelm local \ufb01elds or direct interactions.\nMany popular models, for example, the usual ferromagnetic Ising model [Isi25, Ons44], the\nSherrington-Kirkpatrick mean \ufb01eld model [SK75] of spin glasses, and the Hop\ufb01eld model [Hop82]\nof neural networks, the Curie-Weiss model [DCG68] all belong to the above family of distribu-\ntions, with various special structures on G, the \u03b8u,v\u2019s and the \u03b8v\u2019s. Since its introduction in\n\n\u2217Authors are listed in alphabetical order.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fStatistical Physics, the Ising model has found a myriad of applications in diverse research dis-\nciplines, including probability theory, Markov chain Monte Carlo, computer vision, theoretical\ncomputer science, social network analysis, game theory, computational biology, and neuroscience;\nsee e.g. [LPW09, Cha05, Fel04, DMR11, GG86, Ell93, MS10] and their references. The ubiquity of\nthese applications motivate the problem of inferring Ising models from samples, or inferring statistical\nproperties of Ising models from samples. This type of problem has enjoyed much study in statistics,\nmachine learning, and information theory; see, e.g., [CL68, AKN06, CT06, Cha07, RWL10, JJR11,\nSW12, BGS14, Bre15, VMLC16, BK16, Bha16, BM16, MdCCU16, KM17, HKM17, DDK18].\nDespite the wealth of theoretical study and practical applications of this model, outlined above, there\nare still aspects of it that are poorly understood. In this work, we focus on the important topic of\nconcentration of measure. We are interested in studying the concentration properties of polynomial\nfunctions f (X) of the Ising model. That is, for a random vector X sampled from p as above and a\npolynomial f, we are interested in the concentration of f (X) around its expectation E[f (X)]. Since\nthe coordinates of X take values in {\u00b11}, we can without loss of generality focus our attention to\nmulti-linear functions f.\nWhile the theory of concentration inequalities for functions of independent random variables has\nreached a high level of sophistication, proving concentration of measure for functions of dependent\nrandom variables is signi\ufb01cantly harder, the main tools being martingale methods, logarithmic\nSobolev inequalities and transportation cost inequalities. One shortcoming of the latter methods is\nthat explicit constants are very hard or almost impossible to get. For the Ising model, in particular,\nthe log-Sobolev inequalities of Stroock and Zegarlinski [SZ92], known under high temperature,2\ndo not give explicit constants, and it is also not clear whether they extend to systems beyond the\nlattice. The high temperature regime is an interesting regime of \u2018weak\u2019 dependence where many\ndesirable properties related to Ising models hold. Perhaps the most important of them is that the\ncanonical Markov chain used to sample from these models, namely the Glauber dynamics, is fast\nmixing. Although the high-temperature regime allows only \u2018weak\u2019 pairwise correlations, it is still\nrich enough to encode interesting dependencies. For instance, in neuroscience, it has been seen that\nweak pairwise correlations can coexist with strong correlations in the state of the population as a\nwhole [SBSB06].\nAn alternative approach, proposed recently by Chatterjee [Cha05], is an adaptation to the Ising\nmodel of Stein\u2019s method of exchangeable pairs. This powerful method is well-known in probability\ntheory, and has been used to derive concentration inequalities with explicit constants for functions\nof dependent random variables (see [MJC+14] for a recent work). Chatterjee uses this technique to\nestablish concentration inequalities for Lipschitz functions of the Ising model under high temperature.\nWhile these inequalities are tight (and provide Gaussian tails) for linear functions of the Ising model,\nthey are unfortunately not tight for higher degree polynomials, in that the concentration radius is\noff by factors that depend on the dimension n = |V |. For example, consider the function fc(X) =\ni(cid:54)=j cijXiXj of an Ising model without external \ufb01elds, where the cij\u2019s are signs. Chatterjee\u2019s\nresults imply that this function concentrates at radius \u00b1O(n1.5), but as we show this is suboptimal by\na factor of \u02dc\u2126(\nIn particular, our main technical contribution is to obtain near-tight concentration inequalities for\npolynomial functions of the Ising model, whose concentration radii are tight up to logarithmic factors.\nA corollary of our main result (Theorem 4) is as follows:\nTheorem 1. Consider any degree-d multilinear function f with coef\ufb01cients in [\u22121, 1], de\ufb01ned on\nan Ising model p without external \ufb01eld in the high-temperature regime. Then there exists a constant\nC = C(d) > 0 (depending only on d) such that for any r = \u02dc\u2126d(nd/2), we have\n\n(cid:80)\n\n\u221a\n\nn).\n\n(cid:18)\n\n(cid:19)\n\n.\n\n[|f (X) \u2212 E[f (X)]| > r] \u2264 exp\n\nPr\nX\u223cp\n\n\u2212C \u00b7 r2/d\nn log n\n\nThe concentration radius is tight up to logarithmic factors, and the tail bound is tight up to a\nOd(1/ log n) factor in the exponent of the tail bound.\n\n2High temperature is a widely studied regime of the Ising model where it enjoys a number of useful properties\nsuch as decay of correlations and fast mixing of the Glauber dynamics. Throughout this paper we will take \u201chigh\ntemperature\u201d to mean that Dobrushin\u2019s conditions of weak dependence are satis\ufb01ed. See De\ufb01nition 1.\n\n2\n\n\fOur formal theorem statements for bilinear and higher degree multilinear functions appear as Theo-\nrems 2 and 4 of Sections 3 and 4, respectively. Some further discussion of our results is in order:\n\n\u2022 Under existence of external \ufb01elds, it is easy to see that the above concentration does not\nhold, even for bilinear functions. Motivated by our applications in Section 5 we extend the\nabove concentration of measure result to centered bilinear functions (where each variable\nXi appears as Xi \u2212 E[Xi] in the function) that also holds under arbitrary external \ufb01elds; see\nTheorem 3. We leave extensions of this result to higher degree multinear functions to the\nnext version of this paper.\n\u2022 Moreover, notice that the tails for degree-2 functions are exponential and not Gaussian,\nand this is unavoidable, and that as the degree grows the tails become heavier exponentials,\nand this is also unavoidable. In particular, the tightness of our bound is justi\ufb01ed in the\nsupplementary material.\n\u2022 Lastly, like Chatterjee and Stroock and Zegarlinski, we prove our results under high tem-\nperature. On the other hand, it is easy to construct low temperature Ising models where no\nnon-trivial concentration holds.3\n\nWith our theoretical understanding in hand, we proceed with an experimental evaluation of the ef\ufb01cacy\nof multilinear functions applied to hypothesis testing. Speci\ufb01cally, given a binary vector, we attempt\nto determine whether or not it was generated by an Ising model. Our focus is on testing whether\nchoices in social networks can be approximated as an Ising model, a common and classical assumption\nin the social sciences [Ell93, MS10]. We apply our method to both synthetic and real-world data.\nOn synthetic data, we investigate when our statistics are successful in detecting departures from the\nIsing model. For our real-world data study, we analyze the Last.fm dataset from HetRec\u201911 [CBK11].\nInterestingly, when considering musical preferences on a social network, we \ufb01nd that the Ising model\nmay be more or less appropriate depending on the genre of music.\n\n1.1 Related Work\n\nAs mentioned before, Chatterjee previously used the method of exchangeable pairs to prove variance\nand concentration bounds for linear statistics of the Ising model [Cha05]. In [DDK18], the authors\nprove variance bounds for bilinear statistics. The present work improves upon this by proving\nconcentration rather than bounding the variance, as well as considering general degrees d rather than\njust d = 2. In simultaneous work, Gheissari, Lubetzky, and Peres proved concentration bounds which\nare qualitatively similar to ours, though the techniques are somewhat different [GLP17].\n\n2 Preliminaries\n\nWe will state some preliminaries here, see the supplementary material for further preliminaries.\nWe de\ufb01ne the high-temperature regime, also known as Dobrushin\u2019s uniqueness condition \u2013 in this\npaper, we will use the terms interchangeably.\nDe\ufb01nition 1. Consider an Ising model p de\ufb01ned on a graph G = (V, E) with |V | = n and parameter\nu(cid:54)=v tanh (|\u03b8uv|) \u2264 1 \u2212 \u03b7 for some \u03b7 > 0. Then p is said to satisfy\nvector (cid:126)\u03b8. Suppose maxv\u2208V\nDobrushin\u2019s uniqueness condition, or be in the \u03b7-high temperature regime.\n\n(cid:80)\n\nIn some situations, we may use the parameter \u03b7 implicitly and simply say the Ising model is in the\nhigh temperature regime.\nGlauber dynamics refers to the canonical Markov chain for sampling from an Ising model, see the\nsupplementary material for a formal de\ufb01nition. Glauber dynamics de\ufb01ne a reversible, ergodic Markov\nchain whose stationary distribution is identical to the corresponding Ising model. In many relevant\nsettings, including the high-temperature regime, the dynamics are rapidly mixing and hence offer an\n\nthe multilinear function f (X) =(cid:80)\n\n3Consider an Ising model with no external \ufb01elds, comprising two disjoint cliques of half the vertices with\nin\ufb01nitely strong bonds; i.e. \u03b8v = 0 for all v, and \u03b8u,v = \u221e if u and v belong to the same clique. Now consider\nu(cid:54)\u223cv XuXv, wher u (cid:54)\u223c v denotes that u and v are not neighbors (i.e. belong\nto different cliques). It is easy to see that the maximum absolute value of f (X) is \u2126(n2) and that there is no\nconcentration at radius better than some \u2126(n2).\n\n3\n\n\f.\n\n\u03b7\n\nef\ufb01cient way to sample from Ising models. In particular, the mixing time in \u03b7-high-temperature is\ntmix = n log n\nWe may couple two executions of the Glauber dynamics using a greedy coupling (also known as a\nmonotone coupling). Roughly, this couples the choices made by the runs to maximize the probability\nof agreement; see the supplementary material for a formal de\ufb01nition. One of the key properties of\nthis coupling is that it satis\ufb01es the following contraction property:\nLemma 1. If p is an Ising model in \u03b7-high temperature, then the greedy coupling between two\nexecutions satis\ufb01es the following contraction in Hamming distance:\n\n(cid:12)(cid:12)(cid:12)(X (1)\nof martingale increments, such that Si =(cid:80)i\ntime and K \u2265 0 be such that Pr[|Xi| \u2264 K \u2200 i \u2264 \u03c4 ] = 1. Let vi = Var[Xi|Xi\u22121] and Vt =(cid:80)t\n\nThe key technical tool we use is the following concentration inequality for martingales:\nLemma 2 (Freedman\u2019s Inequality (Proposition 2.1 in [Fre75])). Let X0, X1, . . . , Xt be a sequence\nj=0 Xj forms a martingale sequence. Let \u03c4 be a stopping\ni=0 vi.\n\n(cid:105) \u2264(cid:16)\n\n0 , X (2)\n0 ).\n\n1 \u2212 \u03b7\nn\n\n0 , X (2)\n0 )\n\ndH (X (1)\n\nt\n\ndH (X (1)\n\n, X (2)\n\n)\n\nt\n\n(cid:17)t\n\n(cid:104)\n\nE\n\nThen Pr[|St| \u2265 r and Vt \u2264 b for some t \u2264 \u03c4 ] \u2264 2 exp\n\n(cid:16)\u2212 r2\n\n2(rK+b)\n\n(cid:17)\n\n.\n\n3 Concentration of Measure for Bilinear Functions\n\nIn this section, we describe our main concentration result for bilinear functions of the Ising model.\nThis is not as technically involved as the result for general-degree multilinear functions, but exposes\nmany of the main conceptual ideas. The theorem statement is as follows:\n\nTheorem 2. Consider any bilinear function fa(x) =(cid:80)\n(cid:18)\n\nu,v auvxuxv on an Ising model p (de\ufb01ned\non a graph G = (V, E) such that |V | = n) in \u03b7-high-temperature regime with no external \ufb01eld. Let\n(cid:107)a(cid:107)\u221e = maxu,v auv. If X \u223c p, then for any r \u2265 300(cid:107)a(cid:107)\u221en log2 n/\u03b7 + 2, we have\n\n(cid:19)\n\nPr [|fa(X) \u2212 E [fa(X)]| \u2265 r] \u2264 5 exp\n\n\u2212\n\n\u03b7r\n\n1735(cid:107)a(cid:107)\u221en log n\n\n.\n\nRemark 1. We note that \u03b7-high-temperature is not strictly needed for our results to hold \u2013 we only\nneed Hamming contraction of the \u201cgreedy coupling\u201d (see Lemma 1). This condition implies rapid\nmixing of the Glauber dynamics (in O(n log n) steps) via path coupling (Theorem 15.1 of [LPW09]).\n\n3.1 Overview of the Technique\n\nA well known approach to proving concentration inequalities for functions of dependent random\nvariables is via martingale tail bounds. For instance, Azuma\u2019s inequality gives useful tail bounds\nwhenever one can bound the martingale increments (i.e., the differences between consecutive terms\nof the martingale sequence) of the underlying martingale in absolute value, without requiring any\nform of independence. Such an approach is fruitful in showing concentration of linear functions on\nthe Ising model in high temperature. The Glauber dynamics associated with Ising models in high\ntemperature are fast mixing and offer a natural way to de\ufb01ne a martingale sequence. In particular,\nconsider the Doob martingale corresponding to any linear function f for which we wish to show\nconcentration, de\ufb01ned on the state of the dynamics at some time step t\u2217, i.e. f (Xt\u2217 ). If we choose\nt\u2217 larger than O(n log n) then f (Xt\u2217 ) would be very close to a sample from p irrespective of the\nstarting state. We set the \ufb01rst term of the martingale sequence as E[f (Xt\u2217 )|X0] and the last term is\nsimply f (Xt\u2217 ). By bounding the martingale increments we can show that |f (Xt\u2217 ) \u2212 E[f (Xt\u2217 )|X0]|\nconcentrates at the right radius with high probability. By making t\u2217 large enough we can argue that\nE[f (Xt\u2217 )|X0] \u2248 E[f (X)]. Also, crucially, t\u2217 need not be too large since the dynamics are fast\n\u221a\nmixing. Hence we don\u2019t incur too big a hit when applying Azuma\u2019s inequality, and one can argue\nthat linear functions are concentrated with a radius of \u02dcO(\nn). Crucial to this argument is the fact\nthat linear functions are O(1)-Lipschitz (when the entries of a are constant), bounding the Doob\nmartingale differences to be O(1).\nThe challenge with bilinear functions is that they are O(n)-Lipschitz \u2013 a naive application of the\nsame approach gives a radius of concentration of \u02dcO(n3/2), which albeit better than the trivial radius\n\n4\n\n\fof O(n2) is not optimal. To show stronger concentration for bilinear functions, at a high level, the\nidea is to bootstrap the known fact that linear functions of the Ising model concentrate well at high\ntemperature.\nThe key insight is that, when we have a d-linear function, its Lipschitz constants are bounds on the\nabsolute values of certain d\u2212 1-linear functions. In particular, this implies that the Lipschitz constants\nof a bilinear function are bounds on the absolute values of certain associated linear functions. And\n\u221a\nalthough a worst case bound on the absolute value of linear functions with bounded coef\ufb01cients\nwould be O(n), the fact that linear functions are concentrated within a radius of \u02dcO(\n\u221a\nn), means\nthat bilinear functions are \u02dcO(\nn)-Lipschitz in spirit. In order to exploit this intuition, we turn to\nmore sophisticated concentration inequalities, namely Freedman\u2019s inequality (Lemma 2). This is\na generalization of Azuma\u2019s inequality, which handles the case when the martingale differences\nare only bounded until some stopping time (very roughly, the \ufb01rst time we reach a state where the\nexpectation of the linear function after mixing is large). To apply Freedman\u2019s inequality, we would\nneed to de\ufb01ne a stopping time which has two properties:\n\n1. The stopping time is larger than t\u2217 with high probability. Hence, with a good probability the\nprocess doesn\u2019t stop too early. The harm if the process stops too early (at t < t\u2217) is that we\nwill not be able to effectively decouple E [fa(Xt)|X0] from the choice of X0. t\u2217 is chosen\nto be larger than the mixing time of the Glauber dynamics precisely because it allows us to\nargue that E [fa(Xt\u2217 )|X0] \u2248 E [fa(Xt\u2217 )] = E[fa(X)].\n\u221a\n|Bi+1 \u2212 Bi| = O(\n\n2. For all times i + 1 less than the stopping time, the martingale increments are bounded, i.e.\n\nn) where {Bi}i\u22650 is the martingale sequence.\n\n\u221a\n\nWe observe that the martingale increments corresponding to a martingale de\ufb01ned on a bilinear\n\u221a\nfunction have the \ufb02avor of the conditional expectations of certain linear functions which can be shown\nto concentrate at a radius \u02dcO(\nn) when the process starts at its stationary distribution. This provides\nus with a nice way of de\ufb01ning the stopping time to be the \ufb01rst time when one of these conditional\nexpectations deviates by more than \u2126(\nn poly log n) from the origin. More precisely, we de\ufb01ne\nK(t) of con\ufb01gurations xt, which is parameterized by a function fa(X) and parameter K\na set Ga\n\u221a\n(which we will take to be \u02dc\u2126(\na (Xt\u2217 ) conditioned\na are linear functions which arise when examining the evolution of fa over\non Xt = xt, where f v\nsteps of the Glauber dynamics. Ga\nK(t) are the set of con\ufb01gurations for which all such linear functions\nsatisfy certain conditions, including bounded expectation and concentration around their mean. The\nstopping time for our process TK is de\ufb01ned as the \ufb01rst time we have a con\ufb01guration which leaves\nthis set Ga\nLemma 3. For any t \u2265 0, for t\u2217 = 3tmix,\nPr [Xt /\u2208 Ga\n\nK(t). We can show that the stopping time is large via the following lemma:\n\nn)). The objects of interest are linear functions f v\n\nK(t)] \u2264 8n exp\n\n(cid:18)\n\n(cid:19)\n\n.\n\n\u2212 K 2\n8t\u2217\n\nNext, we require a bound on the conditional variance of the martingale increments. This can be\nshown using the property that the martingale increments are bounded up until the stopping time:\nLemma 4. Consider the Doob martingale where Bi = E[fa(Xt\u2217 )|Xi]. Suppose Xi \u2208 Ga\nXi+1 \u2208 Ga\n\nK(i) and\n\nK(i + 1). Then\n\n|Bi+1 \u2212 Bi| \u2264 16K + 16n2 exp\n\n(cid:18)\n\n\u2212 K 2\n16t\u2217\n\n(cid:19)\n\n.\n\nWith these two pieces in hand, we can apply Freedman\u2019s inequality to bound the desired quantity.\nIt is worth noting that the martingale approach described above closely relates to the technique\nof exchangeable pairs exposited by Chatterjee [Cha05]. When we look at differences for the\nmartingale sequence de\ufb01ned using the Glauber dynamics, we end up analyzing an exchangeable\npair of the following form: sample X \u223c p from the Ising model. Take a step along the Glauber\ndynamics starting from X to reach X(cid:48). (X, X(cid:48)) forms an exchangeable pair. This is precisely how\nChatterjee\u2019s application of exchangeable pairs is set up. Chatterjee then goes on to study a function\nof X and X(cid:48) which serves as a proxy for the variance of f (X) and obtains concentration results\nby bounding the absolute value of this function. The de\ufb01nition of the function involves consider-\ning two greedily coupled runs of the Glauber dynamics just as we do in our martingale based approach.\n\n5\n\n\f\u221a\nTo summarize, our proof of bilinear concentration involves showing various concentration proper-\nties for linear functions via Azuma\u2019s inequality, showing that the martingale has \u02dcO(\nn)-bounded\ndifferences before our stopping time, proving that the stopping time is larger than the mixing time\nwith high probability, and combining these ingredients using Freedman\u2019s inequality. Full details are\nprovided in the supplementary material.\n\n3.2 Concentration Under an External Field\n\nUnder an external \ufb01eld, not all bilinear functions concentrate nicely even in the high temperature\nregime \u2013 in particular, they may concentrate with a radius of \u0398(n1.5), instead of O(n). As such,\nwe must instead consider \u201crecentered\u201d statistics to obtain the same radius of concentration. The\nfollowing theorem is proved in the supplementary material:\nu,v auv(Xu \u2212\nTheorem 3.\nE[Xu])(Xv \u2212 E[Xv]) satisfy the following inequality at high temperature. There exist\nabsolute constants c and c(cid:48) such that, for r \u2265 cn log2 n/\u03b7,\n\u2212\n\n1. Bilinear functions on the Ising model of the form fa(X) =(cid:80)\n(cid:19)\n2. Bilinear functions on the Ising model of the form fa(X (1), X (2)) = (cid:80)\n\nu \u2212\nu )(X (1)\nX (2)\nv ), where X (1), X (2) are two i.i.d samples from the Ising model, satisfy\nthe following inequality at high temperature. There exist absolute constants c and c(cid:48) such\nthat, for r \u2265 cn log2 n/\u03b7,\n\nPr [|fa(X) \u2212 E[fa(X)]| \u2265 r] \u2264 4 exp\n\nu,v auv(X (1)\n\nv \u2212 X (2)\n\nc(cid:48)n log n\n\n(cid:18)\n\nr\n\n.\n\n(cid:104)(cid:12)(cid:12)(cid:12)fa(X (1), X (2)) \u2212 E[fa(X (1), X (2))]\n(cid:12)(cid:12)(cid:12) \u2265 r\n\n(cid:105) \u2264 4 exp\n\nPr\n\n(cid:18)\n\n\u2212\n\nr\n\nc(cid:48)n log n\n\n(cid:19)\n\n.\n\n4 Concentration of Measure for d-linear Functions\n\nMore generally, we can show concentration of measure for d-linear functions on an Ising model in\nhigh temperature, when d \u2265 3. Again, we will focus on the setting with no external \ufb01eld. Although\nwe will follow a recipe similar to that used for bilinear functions, the proof is more involved and\nrequires some new de\ufb01nitions and tools. The proof will proceed by induction on the degree d. Due to\nthe proof being more involved, for ease of exposition, we present the proof of Theorem 4 without\nexplicit values for constants.\nOur main theorem statement is the following:\n\nTheorem 4. Consider any degree-d multilinear function fa(x) =(cid:80)\n\nu\u2208U xu on an\nIsing model p (de\ufb01ned on a graph G = (V, E) such that |V | = n) in \u03b7-high-temperature regime\nwith no external \ufb01eld. Let (cid:107)a(cid:107)\u221e = maxU\u2286V :|U|=d |aU|. There exist constants C1 = C1(d) > 0 and\nC2 = C2(d) > 0 depending only on d, such that if X \u223c p, then for any r \u2265 C1(cid:107)a(cid:107)\u221e(n log2 n/\u03b7)d/2,\nwe have\n\nU\u2286V :|U|=d aU\n\n(cid:81)\n\n(cid:32)\n\n(cid:33)\n\n.\n\nPr [|fa(X) \u2212 E [fa(X)]| > r] \u2264 2 exp\n\n\u2212\n\n\u03b7r2/d\n\nC2(cid:107)a(cid:107)2/d\u221e n log n\n\nSimilar to Remark 1, our theorem statement still holds under the weaker assumption of Hamming\ncontraction. This bound is also tight up to polylogarithmic factors in the radius of concentration and\nthe exponent of the tail bound, see Remark 1 in the supplementary material.\n\n4.1 Overview of the Technique\n\nOur approach uses induction and is similar to the one used for bilinear functions. To show concentra-\ntion for d-linear functions we will use the concentration of (d \u2212 1)-linear functions together with\nFreedman\u2019s martingale inequality.\nConsider the following process: Sample X0 \u223c p from the Ising model of interest. Starting at X0,\nrun the Glauber dynamics associated with p for t\u2217 = (d + 1)tmix steps. We will study the target\n\n6\n\n\fquantity, Pr [|fa(Xt\u2217 ) \u2212 E[fa(Xt\u2217 )|X0]| > K], by de\ufb01ning a martingale sequence similar to the one\nin the bilinear proof. However, to bound the increments of the martingale for d-linear functions we\nwill require an induction hypothesis which is more involved. The reason is that with higher degree\nmultilinear functions (d > 2), the argument for bounding increments of the martingale sequence runs\ninto multilinear terms which are a function of not just a single instance of the dynamics Xt, but also\nof the con\ufb01guration obtained from the coupled run, X(cid:48)\nt. We call such multilinear terms hybrid terms\nand multilinear functions involving hybrid terms as hybrid multilinear functions henceforth. Since the\ntwo runs (of the Glauber dynamics) are coupled greedily to maximize the probability of agreement\nand they start with a small Hamming distance from each other (\u2264 1), these hybrid terms behave very\nsimilar to the non-hybrid multilinear terms. Showing that their behavior is similar, however, requires\nsome supplementary statements about them which are presented in the supplementary material.\nIn addition to the martingale technique of Section 3, an ingredient that is crucial to the proving\nconcentration for d \u2265 3 is a bound on the magnitude of the (d \u2212 1)-order marginals of the Ising\nmodel:\nLemma 5. Consider any Ising model p at high temperature. Let d be a positive integer. We have\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 2\n\n(cid:18) 4nd log n\n\n(cid:19)d/2\n\n.\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88)\n\nu1,...,ud\n\nEp[Xu1Xu2 . . . , Xud ]\n\n\u03b7\n\nThis is because when studying degree d \u2265 3 functions we \ufb01nd ourselves having to bound expected\nvalues of degree d \u2212 1 multilinear functions on the Ising model. A naive bound of Od(nd\u22121) can\nbe argued for these functions but by exploiting the fact that we are in high temperature, we can\nshow a bound of Od(n(d\u22121)/2) via a coupling with the Fortuin-Kastelyn model. When d = 2,\n(d \u2212 1)-linear functions are just linear functions which are zero mean. However, for d \u2265 3, this is not\nthe case. Hence, we \ufb01rst need to prove this desired bound on the marginals of an Ising model in high\ntemperature.\nFurther details are provided in the supplementary material.\n\n5 Experiments\n\nIn this section, we apply our family of bilinear statistics on the Ising model to a problem of statistical\nhypothesis testing. Given a single sample from a multivariate distribution, we attempt to determine\nwhether or not this sample was generated from an Ising model in the high-temperature regime. More\nspeci\ufb01cally, the null hypothesis is that the sample is drawn from an Ising model with a known graph\nstructure with a common edge parameter and a uniform node parameter (which may potentially be\nknown to be 0). In Section 5.1, we apply our statistics to synthetic data. In Section 5.2, we turn our\nattention to the Last.fm dataset from HetRec 2011 [CBK11].\nThe running theme of our experimental investigation is testing the classical and common assumption\nwhich models choices in social networks as an Ising model [Ell93, MS10]. To be more concrete,\nchoices in a network could include whether to buy an iPhone or an Android phone, or whether\nto vote for a Republican or Democratic candidate. Such choices are naturally in\ufb02uenced by one\u2019s\nneighbors in the network \u2013 one may be more likely to buy an iPhone if he sees all his friends have\none, corresponding to an Ising model with positive-weight edges4 In our synthetic data study, we will\nleave these choices as abstract, referring to them only as \u201cvalues,\u201d but in our Last.fm data study, these\nchoices will be whether or not one listens to a particular artist.\nOur general algorithmic approach is as follows. Given a single multivariate sample, we \ufb01rst run the\nmaximum pseudo-likelihood estimator (MPLE) to obtain an estimate of the model\u2019s parameters. The\nMPLE is a canonical estimator for the parameters of the Ising model, and it enjoys strong consistency\nguarantees in many settings of interest [Cha07, BM16]. If the MPLE gives a large estimate of the\nmodel\u2019s edge parameter, this is suf\ufb01cient evidence to reject the null hypothesis. Otherwise, we use\nMarkov Chain Monte Carlo (MCMC) on a model with the MPLE parameters to determine a range\nof values for our statistic. We note that, to be precise, we would need to quantify the error incurred\nby the MPLE \u2013 in favor of simplicity in our exploratory investigation, we eschew this detail, and\n\n4Note that one may also decide against buying an iPhone in this scenario, if one places high value on\n\nindividuality and uniqueness \u2013 this corresponds to negative-weight edges.\n\n7\n\n\fat this point attempt to reject the null hypothesis of the model learned by the MPLE. Our statistic\nis bilinear in the Ising model, and thus enjoys the strong concentration properties explained earlier\nin this paper. Note that since the Ising model will be in the high-temperature regime, the Glauber\ndynamics mix rapidly, and we can ef\ufb01ciently sample from the model using MCMC. Finally, given the\nrange of values for the statistic determined by MCMC, we reject the null hypothesis if p \u2264 0.05.\n\n5.1 Synthetic Data\n\n2)\n\nu=(i,j)\n\nWe proceed with our investigation on synthetic data. Our null hypothesis is that the sample is\ngenerated from an Ising model in the high temperature regime on the grid, with no external \ufb01eld (i.e.\n\u03b8u = 0 for all u) and a common (unknown) edge parameter \u03b8 (i.e., \u03b8uv = \u03b8 iff nodes u and v are\nadjacent in the grid, and 0 otherwise). For the Ising model on the grid, the critical edge parameter for\n\u221a\n. In other words, we are in high-temperature if and only if \u03b8 \u2264 \u03b8c,\nhigh-temperature is \u03b8c = ln(1+\n2\nand we can reject the null hypothesis if the MPLE estimate \u02c6\u03b8 > \u03b8c.\nTo generate departures from the null hypothesis, we give a construction parameterized by \u03c4 \u2208 [0, 1].\nWe provide a rough description of the departures, for a precise description, see the supplemental\nmaterial. Each node x selects a random node y at Manhattan distance at most 2, and sets y\u2019s value\nto x with probability \u03c4. The intuition behind this construction is that each individual selects a\nfriend or a friend-of-a-friend, and tries to convince them to take his value \u2013 he is successful with\nprobability \u03c4. Selecting either a friend or a friend-of-a-friend is in line with the concept of strong\ntriadic closure [EK10] from the social sciences, which suggests that two individuals with a mutual\nfriend are likely to either already be friends (which the social network may not have knowledge of) or\nbecome friends in the future.\nAn example of a sample generated from this distribution with \u03c4 = 0.04 is provided in Figure 1 of the\nsupplementary material, alongside a sample from the Ising model generated with the corresponding\nMPLE parameters. We consider this distribution to pass the \u201ceye test\u201d \u2013 one can not easily distinguish\nthese two distributions by simply glancing at them. However, as we will see, our multilinear statistic\nis able to correctly reject the null a large fraction of the time.\nOur experimental process was as follows. We started with a 40 \u00d7 40 grid, corresponding to a\ndistribution with n = 1600 dimensions. We generated values for this grid according to the depatures\nfrom the null described above, with some parameter \u03c4. We then ran the MPLE estimator to obtain an\nestimate for the edge parameter \u02c6\u03b8, immediately rejecting the null if \u02c6\u03b8 > \u03b8c. Otherwise, we ran the\n(cid:80)\nGlauber dynamics for O(n log n) steps to generate a sample from the grid Ising model with parameter\n\u02c6\u03b8. We repeated this process to generate 100 samples, and for each sample, computed the value of\nv=(k,l):d(u,v)\u22642 XuXv, where d(\u00b7,\u00b7) is the Manhattan distance on\nthe grid. This statistic can be justi\ufb01ed since we wish to account for the possibility of connections\nbetween friends-of-friends of which the social network may be lacking knowledge. We then compare\nwith the value of the statistic Zlocal on the provided sample, and reject the null hypothesis if this\nstatistic corresponds to a p-value of \u2264 0.05. We repeat this for a wide range of values of \u03c4 \u2208 [0, 1],\nand repeat 500 times for each \u03c4.\nOur results are displayed in Figure 1 The x-axis marks the value of parameter \u03c4, and the y-axis\nindicates the fraction of repetitions in which we successfully rejected the null hypothesis. The\nperformance of the MPLE alone is indicated by the orange line, while the performance of our statistic\nis indicated by the blue line. We \ufb01nd that our statistic is able to correctly reject the null at a much\nearlier point than the MPLE alone. In particular, our statistic manages to reject the null for \u03c4 \u2265 0.04,\nwhile the MPLE requires a parameter which is an order of magnitude larger, at 0.4. As mentioned\nbefore, in the former regime (when \u03c4 \u2248 0.04), it appears impossible to distinguish the distribution\nfrom a sample from the Ising model with the naked eye.\n\nthe statistic Zlocal =(cid:80)\n\n5.2 Last.fm Dataset\n\nWe now turn our focus to the Last.fm dataset from HetRec\u201911 [CBK11]. This dataset consists of\ndata from n = 1892 users on the Last.fm online music system. On Last.fm, users can indicate\n(bi-directional) friend relationships, thus constructing a social network \u2013 our dataset has m = 12717\nsuch edges. The dataset also contains users\u2019 listening habits \u2013 for each user we have a list of their\n\n8\n\n\fFigure 1: Power of our statistic on synthetic data.\n\nu\n\nsample, computed the value of the statistics Zk =(cid:80)\n\n\ufb01fty favorite artists, whose tracks they have listened to the most times. We wish to test whether users\u2019\npreference for a particular artist is distributed according to a high-temperature Ising model.\nFixing some artist a of interest, we consider the vector X (a), where X (a)\nis +1 if user u has artist\na in his favorite artists, and \u22121 otherwise. We wish to test the null hypothesis, whether X (a) is\ndistributed according to an Ising model in the high temperature regime on the known social network\ngraph, with common (unknown) external \ufb01eld h (i.e. \u03b8u = h for all u) and edge parameter \u03b8 (i.e.,\n\u03b8uv = \u03b8 iff u and v are neighbors in the graph, and 0 otherwise).\nOur overall experimental process was very similar to the synthetic data case. We gathered a list of the\nten most-common favorite artists, and repeated the following process for each artist a. We consider\n(cid:80)\nthe vector X (a) (de\ufb01ned above) and run the MPLE estimator on it, obtaining estimates \u02c6h and \u02c6\u03b8. We\nthen run MCMC to generate 100 samples from the Ising model with these parameters, and for each\nv:d(u,v)\u2264k(Xu \u2212 tanh(\u02c6h))(Xv \u2212 tanh(\u02c6h)),\nwhere d(\u00b7,\u00b7) is the distance on the graph, and k = 1 (the neighbor correlation statistic) or 2 (the\nlocal correlation statistic). Motivated by our theoretical results (Theorem 3), we consider a statistic\nwhere the variables are recentered by their marginal expectations, as this statistic experiences sharper\nconcentration. We again consider k = 2 to account for the possibility of edges which are unknown to\nthe social network.\nStrikingly, we found that the plausibility of the Ising modelling assumption varies signi\ufb01cantly\ndepending on the artist. We highlight some of our more interesting \ufb01ndings here, see the supplemental\nmaterial for more details. The most popular artist in the dataset was Lady Gaga, who was a favorite\nartist of 611 users in the dataset. We found that X (Lady Gaga) had statistics Z1 = 9017.3 and\nZ2 = 106540. The range of these statistics computed by MCMC can be seen in Figure 2 of the\nsupplementary material \u2013 clearly, the computed statistics fall far outside these ranges, and we can\nreject the null hypothesis with p (cid:28) 0.01. Similar results held for other popular pop musicians,\nincluding Britney Spears, Christina Aguilera, Rihanna, and Katy Perry.\nHowever, we observed qualitatively different results for The Beatles, the fourth most popular artist,\nbeing a favorite of 480 users. We found that X (The Beatles) had statistics Z1 = 2157.8 and Z2 =\n22196. The range of these statistics computed by MCMC can be seen in Figure 3 of the supplementary\nmaterial. This time, the computed statistics fall near the center of this range, and we can not reject\nthe null. Similar results held for the rock band Muse.\nBased on our investigation, our statistic seems to indicate that for the pop artists, the null fails to\neffectively model the distribution, while it performs much better for the rock artists. We conjecture\nthat this may be due to the highly divisive popularity of pop artists like Lady Gaga and Britney Spears\n\u2013 while some users may love these artists (and may form dense cliques within the graph), others have\nlittle to no interest in their music. The null would have to be expanded to accomodate heterogeneity\nto model such effects. On the other hand, rock bands like The Beatles and Muse seem to be much\nmore uniform in their appeal: users seem to be much more homogeneous when it comes to preference\nfor these groups.\n\nu\n\n9\n\n10-310-210-1100Model parameter value00.10.20.30.40.50.60.70.80.91Test probability of successProbability of rejecting the null with local correlations and MPLELocal correlation statisticMPLE\fAcknowledgments\n\nResearch was supported by NSF CCF-1617730, CCF-1650733, and ONR N00014-12-1-0999. Part\nof this work was done while GK was an intern at Microsoft Research New England.\n\nReferences\n\n[AKN06] Pieter Abbeel, Daphne Koller, and Andrew Y. Ng. Learning factor graphs in polynomial\ntime and sample complexity. Journal of Machine Learning Research, 7(Aug):1743\u2013\n1788, 2006.\n\n[BGS14] Guy Bresler, David Gamarnik, and Devavrat Shah. Structure learning of antiferromag-\nnetic Ising models. In Advances in Neural Information Processing Systems 27, NIPS\n\u201914, pages 2852\u20132860. Curran Associates, Inc., 2014.\n\n[Bha16] Bhaswar B. Bhattacharya. Power of graph-based two-sample tests. arXiv preprint\n\narXiv:1508.07530, 2016.\n\n[BK16] Guy Bresler and Mina Karzand. Learning a tree-structured Ising model in order to\n\nmake predictions. arXiv preprint arXiv:1604.06749, 2016.\n\n[BM16] Bhaswar B. Bhattacharya and Sumit Mukherjee. Inference in Ising models. Bernoulli,\n\n2016.\n\n[Bre15] Guy Bresler. Ef\ufb01ciently learning ising models on arbitrary graphs. In Proceedings\nof the 47th Annual ACM Symposium on the Theory of Computing, STOC \u201915, pages\n771\u2013782, New York, NY, USA, 2015. ACM.\n\n[CBK11] Iv\u00e1n Cantador, Peter Brusilovsky, and Tsvi Ku\ufb02ik. Second workshop on information\nheterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of\nthe 5th ACM Conference on Recommender Systems, RecSys \u201911, pages 387\u2013388, New\nYork, NY, USA, 2011. ACM.\n\n[Cha05] Sourav Chatterjee. Concentration Inequalities with Exchangeable Pairs. PhD thesis,\n\nStanford University, June 2005.\n\n[Cha07] Sourav Chatterjee. Estimation in spin glasses: A \ufb01rst step. The Annals of Statistics,\n\n35(5):1931\u20131946, October 2007.\n\n[CL68] C.K. Chow and C.N. Liu. Approximating discrete probability distributions with\ndependence trees. IEEE Transactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[CT06] Imre Csisz\u00e1r and Zsolt Talata. Consistent estimation of the basic neighborhood of\n\nMarkov random \ufb01elds. The Annals of Statistics, 34(1):123\u2013145, 2006.\n\n[DCG68] Stanley Deser, Max Chr\u00e9tien, and Eugene Gross. Statistical Physics, Phase Transitions,\n\nand Super\ufb02uidity. Gordon and Breach, 1968.\n\n[DDK18] Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising models.\nIn Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms,\nSODA \u201918, Philadelphia, PA, USA, 2018. SIAM.\n\n[DMR11] Constantinos Daskalakis, Elchanan Mossel, and S\u00e9bastien Roch. Evolutionary trees\nand the Ising model on the Bethe lattice: A proof of Steel\u2019s conjecture. Probability\nTheory and Related Fields, 149(1):149\u2013189, 2011.\n\n[EK10] David Easley and Jon Kleinberg. Networks, Crowds, and Markets: Reasoning about a\n\nHighly Connected World. Cambridge University Press, 2010.\n\n[Ell93] Glenn Ellison. Learning, local interaction, and coordination. Econometrica, 61(5):1047\u2013\n\n1071, 1993.\n\n[Fel04] Joseph Felsenstein. Inferring Phylogenies. Sinauer Associates Sunderland, 2004.\n\n10\n\n\f[Fre75] David A. Freedman. On tail probabilities for martingales. The Annals of Probability,\n\n3(1):100\u2013118, 1975.\n\n[GG86] Stuart Geman and Christine Graf\ufb01gne. Markov random \ufb01eld image models and their\nIn Proceedings of the International Congress of\n\napplications to computer vision.\nMathematicians, pages 1496\u20131517. American Mathematical Society, 1986.\n\n[GLP17] Reza Gheissari, Eyal Lubetzky, and Yuval Peres. Concentration inequalities for poly-\n\nnomials of contracting Ising models. arXiv preprint arXiv:1706.00121, 2017.\n\n[HKM17] Linus Hamilton, Frederic Koehler, and Ankur Moitra. Information theoretic properties\nof Markov random \ufb01elds, and their algorithmic applications. In Advances in Neural\nInformation Processing Systems 30, NIPS \u201917. Curran Associates, Inc., 2017.\n\n[Hop82] John J. Hop\ufb01eld. Neural networks and physical systems with emergent collective\ncomputational abilities. Proceedings of the National Academy of Sciences, 79(8):2554\u2013\n2558, 1982.\n\n[Isi25] Ernst Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift f\u00fcr Physik A Hadrons\n\nand Nuclei, 31(1):253\u2013258, 1925.\n\n[JJR11] Ali Jalali, Christopher C. Johnson, and Pradeep K. Ravikumar. On learning discrete\ngraphical models using greedy methods. In Advances in Neural Information Processing\nSystems 24, NIPS \u201911, pages 1935\u20131943. Curran Associates, Inc., 2011.\n\n[KM17] Adam Klivans and Raghu Meka. Learning graphical models using multiplicative\nIn Proceedings of the 58th Annual IEEE Symposium on Foundations of\nweights.\nComputer Science, FOCS \u201917, Washington, DC, USA, 2017. IEEE Computer Society.\n\n[LPW09] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov Chains and Mixing\n\nTimes. American Mathematical Society, 2009.\n\n[MdCCU16] Abraham Mart\u00edn del Campo, Sarah Cepeda, and Caroline Uhler. Exact goodness-of-\ufb01t\n\ntesting for the Ising model. Scandinavian Journal of Statistics, 2016.\n\n[MJC+14] Lester Mackey, Michael I. Jordan, Richard Y. Chen, Brendan Farrell, and Joel A. Tropp.\nMatrix concentration inequalities via the method of exchangeable pairs. The Annals of\nProbability, 42(3):906\u2013945, 2014.\n\n[MS10] Andrea Montanari and Amin Saberi. The spread of innovations in social networks.\n\nProceedings of the National Academy of Sciences, 107(47):20196\u201320201, 2010.\n\n[Ons44] Lars Onsager. Crystal statistics. I. a two-dimensional model with an order-disorder\n\ntransition. Physical Review, 65(3\u20134):117, 1944.\n\n[RWL10] Pradeep Ravikumar, Martin J. Wainwright, and John D. Lafferty. High-dimensional\nising model selection using (cid:96)1-regularized logistic regression. The Annals of Statistics,\n38(3):1287\u20131319, 2010.\n\n[SBSB06] Elad Schneidman, Michael J. Berry, Ronen Segev, and William Bialek. Weak pairwise\ncorrelations imply strongly correlated network states in a neural population. Nature,\n440(7087):1007\u20131012, 2006.\n\n[SK75] David Sherrington and Scott Kirkpatrick. Solvable model of a spin-glass. Physical\n\nReview Letters, 35(26):1792, 1975.\n\n[SW12] Narayana P. Santhanam and Martin J. Wainwright. Information-theoretic limits of se-\nlecting binary graphical models in high dimensions. IEEE Transactions on Information\nTheory, 58(7):4117\u20134134, 2012.\n\n[SZ92] Daniel W. Stroock and Boguslaw Zegarlinski. The logarithmic Sobolev inequality\nfor discrete spin systems on a lattice. Communications in Mathematical Physics,\n149(1):175\u2013193, 1992.\n\n11\n\n\f[VMLC16] Marc Vuffray, Sidhant Misra, Andrey Lokhov, and Michael Chertkov. Interaction\nscreening: Ef\ufb01cient and sample-optimal learning of Ising models. In Advances in\nNeural Information Processing Systems 29, NIPS \u201916, pages 2595\u20132603. Curran\nAssociates, Inc., 2016.\n\n12\n\n\f", "award": [], "sourceid": 26, "authors": [{"given_name": "Constantinos", "family_name": "Daskalakis", "institution": "MIT"}, {"given_name": "Nishanth", "family_name": "Dikkala", "institution": "MIT"}, {"given_name": "Gautam", "family_name": "Kamath", "institution": "MIT"}]}