{"title": "Interaction Screening: Efficient and Sample-Optimal Learning of Ising Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2595, "page_last": 2603, "abstract": "We consider the problem of learning the underlying graph of an unknown Ising model on p spins from a collection of i.i.d. samples generated from the model. We suggest a new estimator that is computationally efficient and requires a number of samples that is near-optimal with respect to previously established information theoretic lower-bound. Our statistical estimator has a physical interpretation in terms of \"interaction screening\". The estimator is consistent and is efficiently implemented using convex optimization. We prove that with appropriate regularization, the estimator recovers the underlying graph using a number of samples that is logarithmic in the system size p and exponential in the maximum coupling-intensity and maximum node-degree.", "full_text": "Interaction Screening: Ef\ufb01cient and Sample-Optimal\n\nLearning of Ising Models\n\nMarc Vuffray1, Sidhant Misra2, Andrey Y. Lokhov1,3, and Michael Chertkov1,3,4\n\n1Theoretical Division T-4, Los Alamos National Laboratory, Los Alamos, NM 87545, USA\n2Theoretical Division T-5, Los Alamos National Laboratory, Los Alamos, NM 87545, USA\n\n3Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM 87545, USA\n\n4Skolkovo Institute of Science and Technology, 143026 Moscow, Russia\n\n{vuffray, sidhant, lokhov, chertkov}@lanl.gov\n\nAbstract\n\nWe consider the problem of learning the underlying graph of an unknown Ising\nmodel on p spins from a collection of i.i.d. samples generated from the model. We\nsuggest a new estimator that is computationally ef\ufb01cient and requires a number of\nsamples that is near-optimal with respect to previously established information-\ntheoretic lower-bound. Our statistical estimator has a physical interpretation in\nterms of \u201cinteraction screening\u201d. The estimator is consistent and is ef\ufb01ciently\nimplemented using convex optimization. We prove that with appropriate regulariza-\ntion, the estimator recovers the underlying graph using a number of samples that is\nlogarithmic in the system size p and exponential in the maximum coupling-intensity\nand maximum node-degree.\n\n1\n\nIntroduction\n\nA Graphical Model (GM) describes a probability distribution over a set of random variables which\nfactorizes over the edges of a graph. It is of interest to recover the structure of GMs from random\nsamples. The graphical structure contains valuable information on the dependencies between the\nrandom variables. In fact, the neighborhood of a random variable is the minimal set that provides us\nmaximum information about this variable. Unsurprisingly, GM reconstruction plays an important\nrole in various \ufb01elds such as the study of gene expression [1], protein interactions [2], neuroscience\n[3], image processing [4], sociology [5] and even grid science [6, 7].\nThe origin of the GM reconstruction problem is traced back to the seminal 1968 paper by Chow and\nLiu [8], where the problem was posed and resolved for the special case of tree-structured GMs. In\nthis special tree case the maximum likelihood estimator is tractable and is tantamount to \ufb01nding a\nmaximum weighted spanning-tree. However, it is also known that in the case of general graphs with\ncycles, maximum likelihood estimators are intractable as they require computation of the partition\nfunction of the underlying GM, with notable exceptions of the Gaussian GM, see for instance [9],\nand some other special cases, like planar Ising models without magnetic \ufb01eld [10].\nA lot of efforts in this \ufb01eld has focused on learning Ising models, which are the most general\nGMs over binary variables with pairwise interaction/factorization. Early attempts to learn the Ising\nmodel structure ef\ufb01ciently were heuristic, based on various mean-\ufb01eld approximations, e.g. utilizing\nempirical correlation matrices [11, 12, 13, 14]. These methods were satisfactory in cases when\ncorrelations decrease with graph distance. However it was also noticed that the mean-\ufb01eld methods\nperform poorly for the Ising models with long-range correlations. This observation is not surprising\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fin light of recent results stating that learning the structure of Ising models using only their correlation\nmatrix is, in general, computationally intractable [15, 16].\nAmong methods that do not rely solely on correlation matrices but take advantage of higher-order\ncorrelations that can be estimated from samples, we mention the approach based on sparsistency of\nthe so-called regularized pseudo-likelihood estimator [17]. This estimator, like the one we propose\nin this paper, is from the class of M-estimators i.e. estimators that are the minimum of a sum of\nfunctions over the sampled data [22]. The regularized pseudo-likelihood estimator is regarded as\na surrogate for the intractable likelihood estimator with an additive (cid:96)1-norm penalty to encourage\nsparsity of the reconstructed graph. The sparsistency-based estimator offers guarantees for the\nstructure reconstruction, but the result only applies to GMs that satisfy a certain condition that is\nrather restrictive and hard to verify. It was also proven that the sparsity pattern of the regularized\npseudo-likelihood estimator fails to reconstruct the structure of graphs with long-range correlations,\neven for simple test cases [18].\nPrincipal tractability of structure reconstruction of an arbitrary Ising model from samples was proven\nonly very recently. Bresler, Mossel and Sly in [19] suggested an algorithm which reconstructs the\ngraph without errors in polynomial time. They showed that the algorithm requires number of samples\nthat is logarithmic in the number of variables. Although this algorithm is of a polynomial complexity,\nit relies on an exhaustive neighborhood search, and the degree of the polynomial is equal to the\nmaximal node degree.\nPrior to the work reported in this manuscript the best known procedure for perfect reconstruction\nof an Ising model was through a greedy algorithm proposed by Bresler in [20]. Bresler\u2019s algorithm\nis based on the observation that the mutual information between neighboring nodes in an Ising\nmodel is lower bounded. This observation allows to reconstruct the Ising graph perfectly with only\na logarithmic number of samples and in time quasi-quadratic in the number of variables. On the\nother hand, Bresler\u2019s algorithm suffers from two major practical limitations. First, the number of\nsamples, hence the running time as well, scales double exponentially with respect to the largest node\ndegree and with respect to the largest coupling intensity between pairs of variables. This scaling is\nrather far from the information-theoretic lower-bound reported in [21] predicting instead a single\nexponential dependency on the two aforementioned quantities. Second, Bresler\u2019s algorithm requires\nprior information on the maximum and minimum coupling intensities as well as on the maximum\nnode degree, guarantees which, in reality, are not necessarily available.\nIn this paper we propose a novel estimator for the graph structure of an arbitrary Ising model which\nachieves perfect reconstruction in quasi-quartic time (although we believe it can be provably reduced\nto quasi-quadratic time) and with a number of samples logarithmic in the system size. The algorithm\nis near-optimal in the sense that the number of samples required to achieve perfect reconstruction,\nand the run time, scale exponentially with respect to the maximum node-degree and the maximum\ncoupling intensity, thus matching parametrically the information-theoretic lower bound of [21]. Our\nstatistical estimator has the structure of a consistent M-estimator implemented via convex optimization\nwith an additional thresholding procedure. Moreover it allows intuitive interpretation in terms of\nwhat we coin the \u201cinteraction screening\u201d. We show that with a proper (cid:96)1-regularization our estimator\nreconstructs couplings of an Ising model from a number of samples that is near-optimal. In addition,\nour estimator does not rely on prior information on the model characteristics, such as maximum\ncoupling intensity and maximum degree.\nThe rest of the paper is organized as follows. In Section 2 we give a precise de\ufb01nition of the\nstructure estimation problem for the Ising models and we describe in detail our method for structure\nreconstruction within the family of Ising models. The main results related to the reconstruction\nguarantees are provided by Theorem 1 and Theorem 2. In Section 3 we explain the strategy and the\nsequence of steps that we use to prove our main theorems. Proofs of Theorem 1 and Theorem 2\nare summarized at the end of this Section. Section 4 illustrates performance of our reconstruction\nalgorithm via simulations. Here we show on a number of test cases that the sample complexity of the\nsuggested method scales logarithmically with the number of variables and exponentially with the\nmaximum coupling intensity. In Section 5 we discuss possible generalizations of the algorithm and\nfuture work.\n\n2\n\n\f2 Main Results\nConsider a graph G = (V, E) with p vertexes where V = {1, . . . , p} is the vertex set and E \u2282 V \u00d7 V\nis the undirected edge set. Vertexes i \u2208 V are associated with binary random variables \u03c3i \u2208 {\u22121, +1}\nthat are called spins. Edges (i, j) \u2208 E are associated with non-zero real parameters \u03b8\u2217\nij (cid:54)= 0\nthat are called couplings. An Ising model is a probability distribution \u00b5 over spin con\ufb01gurations\n\u03c3 = {\u03c31, . . . , \u03c3p} that reads as follows:\n\nwhere Z is a normalization factor called the partition function.\n\n(i,j)\u2208E\n\n\uf8eb\uf8ed (cid:88)\n\uf8eb\uf8ed (cid:88)\n\n(i,j)\u2208E\n\n(cid:88)\n\n\u03c3\n\n\uf8f6\uf8f8 ,\n\uf8f6\uf8f8 .\n\n\u00b5 (\u03c3) =\n\n1\nZ\n\nexp\n\n\u03b8\u2217\nij\u03c3i\u03c3j\n\nZ =\n\nexp\n\n\u03b8\u2217\nij\u03c3i\u03c3j\n\n(1)\n\n(2)\n\n(4)\n\n(5)\n\n(6)\n\nNotice that even though the main innovation of this paper \u2013 the ef\ufb01cient \u201cinteraction screening\u201d\nestimator \u2013 can be constructed for the most general Ising models, we restrict our attention in this\npaper to the special case of the Ising models with zero local magnetic-\ufb01eld. This simpli\ufb01cation is not\nnecessary and is done solely to simplify (generally rather bulky) algebra. Later in the text we will\nthus refer to the zero magnetic \ufb01eld model (2) simply as the Ising model.\n\n2.1 Structure-Learning of Ising Models\n\nobserved spin con\ufb01guration \u03c3(k) = {\u03c3(k)\n\nSuppose that n sequences/samples of p spins(cid:8)\u03c3(k)(cid:9)\nments/samples we aim to construct an estimator (cid:98)E of the edge set that reconstructs the structure\n(cid:105)\nP(cid:104)(cid:98)E = E\n(cid:1) is a prescribed reconstruction error.\nwhere \u0001 \u2208(cid:0)0, 1\n\nk=1,...,n are observed. Let us assume that each\np } is i.i.d. from (1). Based on these measure-\n\nexactly with high probability, i.e.\n\n1 , . . . , \u03c3(k)\n\n= 1 \u2212 \u0001,\n\n(3)\n\n2\n\nWe are interested to learn structures of Ising models in the high-dimensional regime where the\nnumber of observations/samples is of the order n = O (ln p). A necessary condition on the number\nof samples is given in [21, Thm. 1]. This condition depends explicitly on the smallest and largest\ncoupling intensity\n\nand on the maximal node degree\n\n\u03b1 := min\n(i,j)\u2208E\n\n|\u03b8\n\n\u2217\n\nij|, \u03b2 := max\n(i,j)\u2208E\n\n|\u03b8\n\n\u2217\n\nij|,\n\nd := max\ni\u2208V\n\n|\u2202i| ,\n\nwhere the set of neighbors of a node i \u2208 V is denoted by \u2202i := {j | (i, j) \u2208 E}.\nAccording to [21], in order to reconstruct the structure of the Ising model with minimum coupling\nintensity \u03b1, maximum coupling intensity \u03b2, and maximum degree d, the required number of samples\nshould be at least\n\n\uf8eb\uf8ed e\u03b2d ln\n\n(cid:16) pd\n4 \u2212 1\n\n(cid:17)\n\n4d\u03b1e\u03b1\n\nn \u2265 max\n\n\uf8f6\uf8f8 .\n\n,\n\nln p\n\n2\u03b1 tanh \u03b1\n\nWe see from Eq. (6) that the exponential dependence on the degree and the maximum coupling\nintensity are both unavoidable. Moreover, when the minimal coupling is small, the number of samples\nshould scale at least as \u03b1\u22122.\nIt remains unknown if the inequality (6) is achievable. It is shown in [21, Thm. 3] that there exists a\n\nreconstruction algorithm with error probability \u0001 \u2208(cid:0)0, 1\n\n(cid:1) if the number of samples is greater than\n\n(16 log p + 4 ln (2/\u0001)) .\n\n(7)\n\n(cid:32)\n\n\u03b2d(cid:0)3e2\u03b2d + 1(cid:1)\n\n(cid:33)2\n\n2\n\nn \u2265\n\nsinh2 (\u03b1/4)\n\n3\n\n\fUnfortunately, the existence proof presented in [21] is based on an exhaustive search with the\nintractable maximum likelihood estimator and thus it does not guarantee actual existence of an\nalgorithm with low computational complexity. Notice also that the number of samples in (7) scales as\nexp (4\u03b2d) when d and \u03b2 are asymptotically large and as \u03b1\u22124 when \u03b1 is asymptotically small.\n\n2.2 Regularized Interaction Screening Estimator\n\n2\n\nThe main contribution of this paper consists in presenting explicitly a structure-learning algorithm\nthat is of low complexity and which is near-optimal with respect to bounds (6) and (7). Our algorithm\nreconstructs the structure of the Ising model exactly, as stated in Eq. (3), with an error probability\n\n(cid:1), and with a number of samples which is at most proportional to exp (6\u03b2d) and \u03b1\u22122. (See\n\n\u0001 \u2208(cid:0)0, 1\n\nTheorem 1 and Theorem 2 below for mathematically accurate statements.) Our algorithm consists of\ntwo steps. First, we estimate couplings in the vicinity of every node. Then, on the second step, we\nthreshold the estimated couplings that are suf\ufb01ciently small to zero. Resulting zero coupling means\nthat the corresponding edge is not present.\nDenote the set of couplings around node u \u2208 V by the vector \u03b8\nu \u2208 Rp\u22121. In this, slightly abusive\n\u2217\nnotation, we use the convention that if a coupling is equal to zero it reads as absence of the edge, i.e.\nui = 0 if and only if (u, i) /\u2208 E. Note that if the node degree is bounded by d, it implies that the\n\u03b8\u2217\nvector of couplings \u03b8\nOur estimator for couplings around node u \u2208 V is based on the following loss function coined the\nInteraction Screening Objective (ISO):\n\n\u2217\nu is non-zero in at most d entries.\n\nn(cid:88)\n\nk=1\n\n\uf8eb\uf8ed\u2212 (cid:88)\n\ni\u2208V \\u\n\nSn (\u03b8u) =\n\n1\nn\n\nexp\n\n\u03b8ui\u03c3(k)\n\nu \u03c3(k)\n\ni\n\n\uf8f6\uf8f8 .\n\nThe ISO is an empirical weighted-average and its gradient is the vector of weighted pair-correlations\n\u2217\nu the exponential weight cancels exactly with the corresponding factor in the\ninvolving \u03c3u. At \u03b8u = \u03b8\ndistribution (1). As a result, weighted pair-correlations involving \u03c3u vanish as if \u03c3u was uncorrelated\nwith any other spins or completely \u201cscreened\u201d from them, which explains our choice for the name of\nthe loss function. This remarkable \u201cscreening\u201d feature of the ISO suggests the following choice of\nthe Regularized Interaction Screening Estimator (RISE) for the interaction vector around node u:\n\n(cid:98)\u03b8u (\u03bb) = argmin\n\n\u03b8u\u2208Rp\u22121\n\nSn (\u03b8u) + \u03bb(cid:107)\u03b8u(cid:107)1 ,\n\nwhere \u03bb > 0 is a tunable parameter promoting sparsity through the additive (cid:96)1-penalty. Notice\nthat the ISO is the empirical average of an exponential function of \u03b8u which implies it is convex.\nMoreover, addition of the (cid:96)1-penalty preserves the convexity of the minimization objective in Eq. (9).\nAs expected, the performance of RISE does depend on the choice of the penalty parameter \u03bb. If \u03bb is\n\ntoo small(cid:98)\u03b8u (\u03bb) is too sensitive to statistical \ufb02uctuations. On the other hand, if \u03bb is too large(cid:98)\u03b8u (\u03bb)\nTheorem 1 (Square Error of RISE). Let(cid:8)\u03c3(k)(cid:9)\n\nhas too much of a bias towards zero. In general, the optimal value of \u03bb is hard to guess. Luckily, the\nfollowing theorem provides strong guarantees on the square error for the case when \u03bb is chosen to be\nsuf\ufb01ciently large.\n\nk=1,...,n be n realizations of p spins drawn i.i.d. from\nan Ising model with maximum degree d and maximum coupling intensity \u03b2. Then for any node u \u2208 V\nand for any \u00011 > 0, the square error of the Regularized Interaction Screening Estimator (9) with\npenalty parameter \u03bb = 4\n\nis bounded with probability at least 1 \u2212 \u00011 by\n\n(8)\n\n(9)\n\n(10)\n\nn\n\n\u221a\n\n(cid:115)\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:113) ln(3p/\u00011)\n(cid:13)(cid:13)(cid:13)(cid:98)\u03b8u (\u03bb) \u2212 \u03b8\n(cid:110)\n(i, j) \u2208 V \u00d7 V |(cid:98)\u03b8ij (\u03bb) +(cid:98)\u03b8ji (\u03bb) \u2265 \u03b1\n\nd (d + 1) e3\u03b2d\n\nln 3p\n\u00011\nn\n\n\u2264 28\n\n\u2217\nu\n\n,\n\n(cid:98)E (\u03bb, \u03b1) =\n\n(cid:111)\n\nwhenever n \u2265 214d2 (d + 1)2 e6\u03b2d ln 3p2\nOur structure estimator (for the second step of the algorithm), Structure-RISE, takes RISE output and\nthresholds couplings whose absolute value is less than \u03b1/2 to zero:\n\n\u00011\n\n.\n\n.\n\n(11)\n\nPerformance of the Structure-RISE is fully quanti\ufb01ed by the following Theorem.\n\n4\n\n\fTheorem 2 (Structure Learning of Ising Models). Let(cid:8)\u03c3(k)(cid:9)\n(cid:17) \u2265 1 \u2212 \u00012,\n\nk=1,...,n be n realizations of p spins\ndrawn i.i.d. from an Ising model with maximum degree d, maximum coupling intensity \u03b2 and minimal\ncoupling intensity \u03b1. Then for any \u00012 > 0, Structure-RISE with penalty parameter \u03bb = 4\nreconstructs the edge-set perfectly with probability\n\nP(cid:16)(cid:98)E (\u03bb, \u03b1) = E\n\n(cid:113) ln(3p2/\u00012)\n\nwhenever n \u2265 max(cid:0)d/16, \u03b1\u22122(cid:1) 218d (d + 1)2 e6\u03b2d ln 3p3\n\n(12)\n\nn\n\n.\n\n\u00012\n\nProofs of Theorem 1 and Theorem 2 are given in Subsection 3.3.\nTheorem 1 states that RISE recovers not only the structure but also the correct value of the couplings\nup to an error based on the available samples. It is possible to improve the square-error bound (10)\neven further by \ufb01rst, running Structure-RISE to recover edges, and then re-running RISE with \u03bb = 0\nfor the remaining non-zero couplings.\nThe computational complexity of RISE is equal to the complexity of minimizing the convex ISO and,\n\nas such, it scales at most as O(cid:0)np3(cid:1). Therefore, computational complexity of Structure-RISE scales\nat most as O(cid:0)np4(cid:1) simply because one has to call RISE at every node. We believe that this running-\nStructure-RISE with running-time O(cid:0)np2(cid:1).\n\ntime estimate can be proven to be quasi-quadratic when using \ufb01rst-order minimization-techniques, in\nthe spirit of [23]. We have observed through numerical experiments that such techniques implement\n\nNotice that in order to implement RISE there is no need for prior knowledge on the graph parameters.\nThis is a considerable advantage in practical applications where the maximum degree or bounds on\ncouplings are often unknown.\n\n3 Analysis\n\nThe Regularized Interaction Screening Estimator (9) is from the class of the so-called regularized\nM-estimators. Negahban et al. proposed in [22] a framework to analyze the square error of such\nestimators. As per [22], enforcing only two conditions on the loss function is suf\ufb01cient to get a handle\non the square error of an (cid:96)1-regularized M-estimator.\nThe \ufb01rst condition links the choice of the penalty parameter to the gradient of the objective function.\nCondition 1. The (cid:96)1-penalty parameter strongly enforces regularization if it is greater than any\npartial derivatives of the objective function at \u03b8u = \u03b8\u2217\n\u03bb \u2265 2(cid:107)\u2207Sn (\u03b8\u2217\n\nu, i.e.\nu)(cid:107)\u221e .\n\n(13)\n\n\u03ba then the square-error is bounded by\nd \u03bb\n\u2264 3\n\n\u221a\n\n(cid:13)(cid:13)(cid:13)(cid:98)\u03b8u \u2212 \u03b8\n\n\u2217\nu\n\n(cid:13)(cid:13)(cid:13)2\n\n5\n\nu has at most d non-zero entries, then the\n\nCondition 1 guarantees that if the vector of couplings \u03b8\u2217\n\nestimation difference(cid:98)\u03b8u (\u03bb) \u2212 \u03b8\n\n\u2217\nu lies within the set\n\n(cid:110)\n\u221a\n\u2206 \u2208 Rp\u22121 | (cid:107)\u2206(cid:107)1 \u2264 4\n\n(cid:111)\n\nd(cid:107)\u2206(cid:107)2\n\n(14)\nThe second condition ensure that the objective function is strongly convex in a restricted subset of\nRp\u22121. Denote the reminder of the \ufb01rst-order Taylor expansion of the objective function by\n\nK :=\n\n.\n\n\u03b4Sn (\u2206u, \u03b8\u2217\n\nu) := Sn (\u03b8\u2217\n\nu + \u2206u) \u2212 Sn (\u03b8\u2217\n\nu) \u2212 (cid:104)\u2207Sn (\u03b8\u2217\n\nu) , \u2206u(cid:105) ,\n\nwhere \u2206u \u2208 Rp\u22121 is an arbitrary vector. Then the second condition reads as follows.\nCondition 2. The objective function is restricted strongly convex with respect to K on a ball of\nu, if for all \u2206u \u2208 K such that (cid:107)\u2206u(cid:107)2 \u2264 R, there exists a constant \u03ba > 0\nradius R centered at \u03b8u = \u03b8\u2217\nsuch that\nStrong regularization and restricted strong convexity enables us to control that the minimizer(cid:98)\u03b8u of\n(16)\n\nthe full objective (9) lies in the vicinity of the sparse vector of parameters \u03b8\u2217\nis given in the proposition following from [22, Thm. 1].\n\u221a\nProposition 1. If the (cid:96)1-regularized M-estimator of the form (9) satis\ufb01es Condition 1 and Condition\n2 with R > 3\n\nu. The precise formulation\n\nu) \u2265 \u03ba(cid:107)\u2206u(cid:107)2\n2 .\n\n\u03b4Sn (\u2206u, \u03b8\u2217\n\n(15)\n\nd\n\n\u03bb\n\u03ba\n\n.\n\n(17)\n\n\f3.1 Gradient Concentration\nLike the ISO (8), its gradient in any component l \u2208 V \\ u is an empirical average\n\n\u2202\n\n\u2202\u03b8ul\n\nSn (\u03b8u) =\n\n1\nn\n\nX (k)\n\nul (\u03b8u) ,\n\nk=1\n\nn(cid:88)\n\uf8eb\uf8ed\u2212 (cid:88)\n\ni\u2208V \\u\n\n\uf8f6\uf8f8 .\n\nXul (\u03b8u) = \u2212\u03c3u\u03c3l exp\n\n\u03b8ui\u03c3u\u03c3i\n\nwhere the random variables X (k)\nto\n\nul (\u03b8u) are i.i.d and they are related to the spin con\ufb01gurations according\n\n(18)\n\n(19)\n\n(20)\n\n(22)\n\nIn order to prove that the ISO gradient concentrates we have to state few properties of the support,\nthe mean and the variance of the random variables (19), expressed in the following three Lemmas.\nThe \ufb01rst of the Lemmas states that at \u03b8u = \u03b8\nLemma 1. For any Ising model with p spins and for all l (cid:54)= u \u2208 V\n\n\u2217\nu, the random variable Xul (\u03b8\n\n\u2217\nu) has zero mean.\n\nE [Xul (\u03b8\n\n\u2217\nu)] = 0.\n\nE(cid:104)\n\nu)2(cid:105)\n\n\u2217\n\nAs a direct corollary of the Lemma 1, \u03b8u = \u03b8\nThe second Lemma proves that at \u03b8u = \u03b8\nLemma 2. For any Ising model with p spins and for all l (cid:54)= u \u2208 V\n\n\u2217\nu, the random variable Xul (\u03b8\n\n\u2217\nu is always a minimum of the averaged ISO (8).\n\n\u2217\nu) has a variance equal to one.\n\nXul (\u03b8\n\n= 1.\n\n(21)\n\nThe next lemma states that at \u03b8u = \u03b8\nLemma 3. For any Ising model with p spins, with maximum degree d and maximum coupling\nintensity \u03b2, it is guaranteed that for all l (cid:54)= u \u2208 V\n\n\u2217\nu, the random variable Xul (\u03b8\n\n\u2217\nu) has a bounded support.\n\n|Xul (\u03b8\n\nu)| \u2264 exp (\u03b2d) .\n\u2217\n\nWith Lemma 1, 2 and 3, and using Berstein\u2019s inequality we are now in position to prove that every\npartial derivative of the ISO concentrates uniformly around zero as the number of samples grows.\nLemma 4. For any Ising model with p spins, with maximum degree d and maximum coupling\nintensity \u03b2. For any \u00013 > 0, if the number of observation satis\ufb01es n \u2265 exp (2\u03b2d) ln 2p\n, then the\nfollowing bound holds with probability at least 1 \u2212 \u00013:\n\n\u00013\n\n(cid:115)\nu)(cid:107)\u221e \u2264 2\n\u2217\n\nln 2p\n\u00013\nn\n\n.\n\n(23)\n\n(cid:107)\u2207Sn (\u03b8\n\n3.2 Restricted Strong-Convexity\n\nn(cid:88)\n\n(cid:32)\n\u2212(cid:88)\n\n(cid:33)\n\n\uf8eb\uf8ed (cid:88)\n\n\uf8f6\uf8f8 ,\n\nThe remainder of the \ufb01rst-order Taylor-expansion of the ISO, de\ufb01ned in Eq. (15) is explicitly\ncomputed\n\nk=1\n\nexp\n\nu \u03c3(k)\n\n\u03b8\u2217\nui\u03c3(k)\n\n\u03b4Sn (\u2206u, \u03b8\u2217) =\n\n1\nn\nwhere f (z) := e\u2212z \u2212 1 + z.\nIn the following lemma we prove that Eq. (24) is controlled by a much simpler expression using a\nlower-bound on f (z).\nLemma 5. For all \u2206u \u2208 Rp\u22121, the remainder of the \ufb01rst-order Taylor expansion admits the following\nlower-bound\n\n\u2206ui\u03c3(k)\n\nu \u03c3(k)\n\ni\u2208V \\u\n\ni\u2208\u2202u\n\n(24)\n\nf\n\ni\n\ni\n\n\u03b4Sn (\u2206u, \u03b8\u2217) \u2265\n\ne\u2212\u03b2d\n\n2 + (cid:107)\u2206u(cid:107)1\n\n\u2206(cid:62)\nu H n\u2206u\n\n(cid:80)n\n\nk=1 \u03c3(k)\n\ni \u03c3(k)\n\nj\n\n(25)\nfor i, j \u2208 V \\ u.\n\nwhere H n is an empirical covariance matrix with elements H n\n\nij = 1\nn\n\n6\n\n\fLemma 5 enables us to control the randomness in \u03b4Sn (\u2206u, \u03b8\u2217) through the simpler matrix H n that\nis independent of \u2206u. This last point is crucial as we show in the next lemma that H n concentrates\nindependently of \u2206u towards its mean.\nLemma 6. Consider an Ising model with p spins, with maximum degree d and maximum coupling\nintensity \u03b2. Let \u03b4 > 0, \u00014 > 0 and n \u2265 2\n. Then with probability greater than 1 \u2212 \u00014, we have\nfor all i, j \u2208 V \\ u\n\n\u03b42 ln p2\n\n\u00014\n\n(cid:12)(cid:12)H n\n\nij \u2212 Hij\n\n(cid:12)(cid:12) \u2264 \u03b4,\n\nwhere the matrix H is the covariance matrix with elements Hij = E [\u03c3i\u03c3j], for i, j \u2208 V \\ u.\nThe last ingredient that we need is a proof that the smallest eigenvalue of the covariance matrix H is\nbounded away from zero independently of the dimension p. Equivalently the next lemma shows that\nthe quadratic form associated with H is non-degenerate regardless of the value of p.\nLemma 7. Consider an Ising model with p spins, with maximum degree d and maximum coupling\nintensity \u03b2. For all \u2206u \u2208 Rp\u22121 the following bound holds\n\nu H\u2206u \u2265 e\u22122\u03b2d\n\u2206(cid:62)\n\nd + 1\n\n(cid:107)\u2206u(cid:107)2\n2 .\n\n(27)\n\nWe stress that Lemma 7 is a deterministic result valid for all \u2206u \u2208 Rp\u22121. We are now in position to\nprove the restricted strong convexity of the ISO.\nLemma 8. Consider an Ising model with p spins, with maximum degree d and maximum coupling\nintensity \u03b2. For all \u00014 > 0 and R > 0, when n \u2265 211d2 (d + 1)2 e4\u03b2d ln p2\nthe ISO (8) satis\ufb01es, with\nprobability at least 1 \u2212 \u00014, the restricted strong convexity condition\n\n\u00014\n\n(26)\n\n(28)\n\n\u03b4Sn (\u2206u, \u03b8\u2217\n\n(cid:16)\n\n(cid:17) (cid:107)\u2206u(cid:107)2\nu) \u2265\n4 (d + 1)\n\u221a\nd(cid:107)\u2206u(cid:107)2 and (cid:107)\u2206u(cid:107)2 \u2264 R.\n\ne\u22123\u03b2d\n1 + 2\n\n\u221a\n\ndR\n\n2 ,\n\nfor all \u2206u \u2208 Rp\u22121 such that (cid:107)\u2206u(cid:107)1 \u2264 4\n3.3 Proof of the main Theorems\n\n(cid:113) 1\n\nProof of Theorem 1 (Square Error of RISE). We seek to apply Proposition 1 to the Regularized Inter-\nn ln 3p/\u00011, it follows\naction Screening Estimator (9). Using \u00013 = 2\u00011\nthat Condition 1 is satis\ufb01ed with probability greater than 1 \u2212 2\u00011/3, whenever n \u2265 e2\u03b2d ln 3p\n. Using\n\u221a\n\u221a\n\u00014 = \u00011/3 in Lemma 8, and observing that 12\ndR) < R, for R = 2/\nd\nand n \u2265 214d2 (d + 1)2 e6\u03b2d ln 3p2\n, we conclude that condition 2 is satis\ufb01ed with probability greater\nthan 1 \u2212 \u00011\n\n3 . Theorem 1 then follows by using a union bound and then applying Proposition 1.\n\n3 in Lemma 4 and letting \u03bb = 4\n\nd\u03bbe3\u03b2d(d + 1)(1 + 2\n\n\u221a\n\n\u00011\n\n\u00011\n\nThe proof of Theorem 2 becomes an immediate application of Theorem 1 for achieving an estimation\nof couplings at each node with squared-error of \u03b1/2 and with probability 1 \u2212 \u00011 = 1 \u2212 \u00012/p.\n\n4 Numerical Results\n\n(cid:113) 1\n\nWe test performance of the Struct-RISE, with the strength of the l1-regularization parametrized by\nn ln(3p2/\u0001), on Ising models over two-dimensional grid with periodic boundary conditions\n\u03bb = 4\n(thus degree of every node in the graph is 4). We have observed that this topology is one of the\nhardest for the reconstruction problem. We are interested to \ufb01nd the minimal number of samples,\nnmin, such that the graph is perfectly reconstructed with probability 1 \u2212 \u0001 \u2265 0.95. In our numerical\nexperiments, we recover the value of nmin as the minimal n for which Struct-RISE outputs the perfect\nstructure 45 times from 45 different trials with n samples, thus guaranteeing that the probability of\nperfect reconstruction is greater than 0.95 with a statistical con\ufb01dence of at least 90%.\nWe \ufb01rst verify the logarithmic scaling of nmin with respect to the number of spins p. The couplings\nare chosen uniform and positive \u03b8\u2217\nij = 0.7. This choice ensures that samples generated by Glauber\n\n7\n\n\f\u221a\n\np \u00d7 \u221a\n\nFigure 1: Left: Linear-exponential plot showing the observed relation between nmin and p. The graph\np two-dimensional grid with uniform and positive couplings \u03b8\u2217 = 0.7. Right: Linear-\nis a\nexponential plot showing the observed relation between nmin and \u03b2. The graph is the two-dimensional\n4 \u00d7 4 grid. In red the couplings are uniform and positive and in blue the couplings have uniform\nintensity but random sign.\n\ndynamics are i.i.d. according to (1). Values of nmin for p \u2208 {9, 16, 25, 36, 49, 64} are shown on the\nleft in Figure 1. Empirical scaling is, \u2248 1.1 \u00d7 105 ln p, which is orders of magnitude better than the\nrather conservative prediction of the theory for this model, 3.2 \u00d7 1015 ln p.\nWe also test the exponential scaling of nmin with respect to the maximum coupling intensity \u03b2. The\ntest is conducted over two different settings both with p = 16 spins: the ferromagnetic case where all\ncouplings are uniform and positive, and the spin glass case where the sign of couplings is assigned\n\nuniformly at random. In both cases the absolute value of the couplings,(cid:12)(cid:12)\u03b8\u2217\n\n(cid:12)(cid:12), is uniform and equal\n\nij\n\nto \u03b2. To ensure that the samples are i.i.d, we sample directly from the exhaustive weighted list of\nthe 216 possible spin con\ufb01gurations. The structure is recovered by thresholding the reconstructed\ncouplings at the value \u03b1/2 = \u03b2/2.\nExperimental values of nmin for different values of the maximum coupling intensity, \u03b2, are shown\non the right in Fig. 1. Empirically observed exponential dependence on \u03b2 is matched best by,\nexp (12.8\u03b2), in the ferromagnetic case and by, exp (5.6\u03b2), in the case of the spin glass. Theoretical\nbound for d = 4 predicts exp (24\u03b2). We observe that the difference in sample complexity depends\nsigni\ufb01cantly on the type of interaction. An interesting observation one can make based on these\nexperiments is that the case which is harder from the sample-generating perspective is easier for\nlearning and vice versa.\n\n5 Conclusions and Path Forward\n\nIn this paper we construct and analyze the Regularized Interaction Screening Estimator (9). We show\nthat the estimator is computationally ef\ufb01cient and needs an optimal number of samples for learning\nIsing models. The RISE estimator does not require any prior knowledge about the model parameters\nfor implementation and it is based on the minimization of the loss function (8), that we call the\nInteraction Screening Objective. The ISO is an empirical average (over samples) of an objective\ndesigned to screen an individual spin/variable from its factor-graph neighbors.\nEven though we focus in this paper solely on learning pair-wise binary models, the \u201cinteraction\nscreening\u201d approach we introduce here is generic. The approach extends to learning other Graphical\nModels, including those over higher (discrete, continuous or mixed) alphabets and involving high-\norder (beyond pair-wise) interactions. These generalizations are built around the same basic idea\npioneered in this paper \u2013 the interaction screening objective is (a) minimized over candidate GM\nparameters at the actual values of the parameters we aim to learn; and (b) it is an empirical average over\nsamples. In the future, we plan to explore further theoretical and experimental power, characteristics\nand performance of the generalized screening estimator.\nAcknowledgment: We are thankful to Guy Bresler and Andrea Montanari for valuable discussions,\ncomments and insights. The work was supported by funding from the U.S. Department of Energy\u2019s\nOf\ufb01ce of Electricity as part of the DOE Grid Modernization Initiative.\n\n8\n\n\fReferences\n[1] D. Marbach, J. C. Costello, R. Kuffner, N. M. Vega, R. J. Prill, D. M. Camacho, K. R. Allison, M. Kellis,\nJ. J. Collins, and G. Stolovitzky, \u201cWisdom of crowds for robust gene network inference,\u201d Nat Meth, vol. 9,\npp. 796\u2013804, Aug 2012.\n\n[2] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa,\nand M. Weigt, \u201cDirect-coupling analysis of residue coevolution captures native contacts across many\nprotein families,\u201d Proceedings of the National Academy of Sciences, vol. 108, no. 49, pp. E1293\u2013E1301,\n2011.\n\n[3] E. Schneidman, M. J. Berry, R. Segev, and W. Bialek, \u201cWeak pairwise correlations imply strongly correlated\n\nnetwork states in a neural population,\u201d Nature, vol. 440, pp. 1007\u20131012, Apr 2006.\n\n[4] S. Roth and M. J. Black, \u201cFields of experts: a framework for learning image priors,\u201d in Computer Vision\nand Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, pp. 860\u2013867\nvol. 2, June 2005.\n\n[5] N. Eagle, A. S. Pentland, and D. Lazer, \u201cInferring friendship network structure by using mobile phone\n\ndata,\u201d Proceedings of the National Academy of Sciences, vol. 106, no. 36, pp. 15274\u201315278, 2009.\n\n[6] M. He and J. Zhang, \u201cA dependency graph approach for fault detection and localization towards secure\n\nsmart grid,\u201d IEEE Transactions on Smart Grid, vol. 2, pp. 342\u2013351, June 2011.\n\n[7] D. Deka, S. Backhaus, and M. Chertkov, \u201cStructure learning and statistical estimation in distribution\n\nnetworks,\u201d submitted to IEEE Control of Networks; arXiv:1501.04131; arXiv:1502.07820, 2015.\n\n[8] C. Chow and C. Liu, \u201cApproximating discrete probability distributions with dependence trees,\u201d IEEE\n\nTransactions on Information Theory, vol. 14, pp. 462\u2013467, May 1968.\n\n[9] A. d\u2019Aspremont, O. Banerjee, and L. E. Ghaoui, \u201cFirst-order methods for sparse covariance selection,\u201d\n\nSIAM Journal on Matrix Analysis and Applications, vol. 30, no. 1, pp. 56\u201366, 2008.\n\n[10] J. K. Johnson, D. Oyen, M. Chertkov, and P. Netrapalli, \u201cLearning planar ising models,\u201d Journal of Machine\n\nLearning, in press; arXiv:1502.00916, 2015.\n\n[11] T. Tanaka, \u201cMean-\ufb01eld theory of Boltzmann machine learning,\u201d Phys. Rev. E, vol. 58, pp. 2302\u20132310, Aug\n\n1998.\n\n[12] H. J. Kappen and F. d. B. Rodr\u00edguez, \u201cEf\ufb01cient learning in Boltzmann machines using linear response\n\ntheory,\u201d Neural Computation, vol. 10, no. 5, pp. 1137\u20131156, 1998.\n\n[13] Y. Roudi, J. Tyrcha, and J. Hertz, \u201cIsing model for neural data: Model quality and approximate methods\n\nfor extracting functional connectivity,\u201d Phys. Rev. E, vol. 79, p. 051915, May 2009.\n\n[14] F. Ricci-Tersenghi, \u201cThe Bethe approximation for solving the inverse Ising problem: a comparison with\nother inference methods,\u201d Journal of Statistical Mechanics: Theory and Experiment, vol. 2012, no. 08,\np. P08015, 2012.\n\n[15] G. Bresler, D. Gamarnik, and D. Shah, \u201cHardness of parameter estimation in graphical models,\u201d in\nAdvances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D.\nLawrence, and K. Q. Weinberger, eds.), pp. 1062\u20131070, Curran Associates, Inc., 2014.\n\n[16] A. Montanari, \u201cComputational implications of reducing data to suf\ufb01cient statistics,\u201d Electron. J. Statist.,\n\nvol. 9, no. 2, pp. 2370\u20132390, 2015.\n\n[17] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, \u201cHigh-dimensional Ising model selection using\n\n(cid:96)1-regularized logistic regression,\u201d Ann. Statist., vol. 38, pp. 1287\u20131319, 06 2010.\n\n[18] A. Montanari and J. A. Pereira, \u201cWhich graphical models are dif\ufb01cult to learn?,\u201d in Advances in Neural\nInformation Processing Systems 22 (Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and\nA. Culotta, eds.), pp. 1303\u20131311, Curran Associates, Inc., 2009.\n\n[19] G. Bresler, E. Mossel, and A. Sly, \u201cReconstruction of Markov random \ufb01elds from samples: Some\n\nobservations and algorithms,\u201d SIAM Journal on Computing, vol. 42, no. 2, pp. 563\u2013578, 2013.\n\n[20] G. Bresler, \u201cEf\ufb01ciently learning Ising models on arbitrary graphs,\u201d in Proceedings of the Forty-Seventh\n\nAnnual ACM on Symposium on Theory of Computing, pp. 771\u2013782, ACM, 2015.\n\n[21] N. P. Santhanam and M. J. Wainwright, \u201cInformation-theoretic limits of selecting binary graphical models\n\nin high dimensions,\u201d IEEE Transactions on Information Theory, vol. 58, pp. 4117\u20134134, July 2012.\n\n[22] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu, \u201cA uni\ufb01ed framework for high-dimensional\n\nanalysis of M-estimators with decomposable regularizers,\u201d Statist. Sci., vol. 27, pp. 538\u2013557, 11 2012.\n\n[23] A. Agarwal, S. Negahban, and M. J. Wainwright, \u201cFast global convergence of gradient methods for\n\nhigh-dimensional statistical recovery,\u201d Ann. Statist., vol. 40, pp. 2452\u20132482, 10 2012.\n\n9\n\n\f", "award": [], "sourceid": 1350, "authors": [{"given_name": "Marc", "family_name": "Vuffray", "institution": "Los Alamos National Laboratory"}, {"given_name": "Sidhant", "family_name": "Misra", "institution": "Los Alamos National Laboratory"}, {"given_name": "Andrey", "family_name": "Lokhov", "institution": "Los Alamos National Laboratory"}, {"given_name": "Michael", "family_name": "Chertkov", "institution": "Los Alamos National Laboratory"}]}