NeurIPS 2020

### Review 1

Summary and Contributions: The paper presents sample complexity results for estimating the parameters of Ising models under Huber's contamination model. In particular, it is assumed that samples are drawn from a distribution: (1-eps) P_theta + eps Q where P_theta is an Ising model without external fields and Q is an arbitrary distribution over {-1, 1}^p. As usual, the Ising model density P_theta(x) over the hypercube is proportional to exp( x^T theta x). The Ising model P_theta to be learned is assumed to be in high temperature, satisfying Dobrushin's condition. The goal is to estimate, to within some small ell_2 error, the vector of theta_ij parameters at each node i of the Ising model, i.e. for every node i, estimate theta_i=(theta_ij)_j The paper provides methods to do so whose error scales as robustness error + sampling error The first term is eps \sqrt{log(1/eps)} while the second takes the form - sqrt{ k log p /n}, if the Ising model has k edges or - sqrt{d log p/n}, when every node has degree d So the paper provides sparse regression type bounds for the theta_i's from a contaminated distribution. Unfortunately, none of these methods are computationally efficient. In terms of techniques the most interesting contribution in my view is probably Theorem 1 (in Section 2/appendix B) which translates from a TV bound between two Ising models to an ell_2 bound for their corresponding theta_i vectors, under Dobrushin's condition. The translation from a TV bound to a parameter bound analyzes the conditional log likelihood of the sample at a node conditioning on the other nodes. That function is concave and is maximized at the true parameters, and the argument proceeds by analyzing its strong concavity and bounding its gradient. The proof uses recent results proven about the Ising model under Dobrushin's condition, and this approach of comparing the gradient and the Hessian has also been done in recent papers on Ising models. In any event, it is a good bound to know. Another interesting insight (in Section 4/appendix D) is combining the afore-mentioned recent results about the Ising model, the strong concavity of the conditional log-likelihood and robust mean estimation in order to do robust l1 regression at every node.

Strengths: Learning under sample contamination is an interesting problem for ML and Statistics. Ising models are a prominent family of distributions and Huber's contamination model is a standard model. The paper provides sparse regression error bounds under Huber's model which are welcome. It also provides a structural result translating TV error to parameter error, and an approach to robust sparse l1 regression around every node.

Weaknesses: The algorithms are inefficient, and the techniques are not very novel in my view: -Analyzing the log likelihood around the optimum by bounding its strong convexity and its gradient isn't too novel an approach - Using the Yatracos class to learn Ising models has been employed in a recent paper: https://arxiv.org/abs/1806.06887 - In fact, all the technical staff in Section 3 can be circumvented as follows: 1. It is easy to construct an epsilon net in TV distance of all Ising models with at most k edges of size at most (p^2 choose k) (k/eps^2)^k (the way to do this is to use the exact expression for Symmetric KL distance between two Ising models, Pinsker's inequality, and gridding of all edge parameters) 2. Whenever a eps-net of size N exists for a family of distribution, it is known that logN / eps^2 samples from a target distribution in the family suffice to select a distribution from the net that is eps close to the target the error rate of sqrt{k log(p)/n} for learning in TV distance follow from the above observations.. In any event, obtaining the TV to parameter bound and doing robust sparse regression at each node hasn't been done before as far as I know, and it employs some heavy machinery recently shown for the Ising model.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

### Review 2

Summary and Contributions: The paper proposed a method for learning an Ising model from data samples from a contamined model (a mixture of an Ising model and an arbitrary distribution).

Strengths: The contamination model of data samples is well-known and reasonable (Huber).

Weaknesses: The condition \max_{u \in V} \sum_{v \in V} |\theta_{uv}| < 1 from Line 78 and 89 seems too restrictive. The proposed estimators do not recover the true model as n goes to infinity. The proposed estimators are not computationally efficient. (Please see my additional feedback below.)

Correctness: The main claims seem correct.

Clarity: The paper is clear.

Relation to Prior Work: Relation to prior work is properly discussed.

Reproducibility: Yes

Additional Feedback: Please discuss the condition \max_{u \in V} \sum_{v \in V} |\theta_{uv}| < 1 from Line 78 and 89. For a growing number of nodes in V, this condition seems very restrictive because of the summation across all nodes. The proposed estimators do not recover the current model as n goes to infinity. In Lemma 2 and Corollary 2, there are terms 2\epsilon and 2\epsilon\sqrt{log(1/\epsilon)} respectively, which are constant with respect to n. Similarly, more complicated terms are in Theorem 2 and Corollary 3. Is this related to Lemma 1? Please clarify. Estimators in Eq.(8) and Eq.(10) are not computationally efficient. Can the authors provide more insight into #P-hardness? I can see that in both cases the \argmin and the \sup are hard to efficiently compute. === AFTER REBUTTAL I am satisfied with the response regarding the Dobrushin condition (\max_{u \in V} \sum_{v \in V} |\theta_{uv}| < 1). I am also satisfied with the response to the estimator inconsistency, specifically the clarification regarding Lemma 1. The authors state that their estimations are "exponential time" (i.e., worse than #P-hard). Still I believe the results can be interesting for the community. Given the above, I raise my evaluation from 6 to 7.

### Review 3

Summary and Contributions: The paper studies the problem of robustly learning discrete graphical models where a small fraction of the samples can be arbitrarily corrupted (known as Huber’s contamination model). It focuses on Ising models where prior work provided partial results. The authors operate (for most of the paper) under the Dobrushin’s uniqueness regime where they are able to obtain a bound on the l2 distance between the underlying parameters vectors of two Ising models which are epsilon-close in Total Variation (TV) distance. They use this together with the observation that under epsilon Huber’s contamination, the TV between corrupted and true distribution is at most epsilon; to show that in the asymptotic setting a minimum distance estimator can recover the parameter vector and edge structure (when minimum edge strength is large enough) for graphs on p nodes with (i) bounded (k) number of edges, (ii) bounded degree (d). Then they move to showing how an approximation of the minimum distance estimator (inspired by work of Yatracos) works in the non-asymptotic setting for graphs with bounded number of edges. For graphs with bounded degree in the non-asymptotic setting: they propose an estimator which performs a robust sparse logistic regression to recover the parameters of the uncontaminated model.

Strengths: The paper studies a problem of high relevance to NeurIPS community and provides a tight understanding of the information theoretic complexity of the problem for specific graph families. The claims are theoretically sound. As a by-product the paper also provides a modulus of continuity bound for Ising models under Dobrushin’s condition which is the first result of its kind for this regime. This is a non-trivial and novel statement which has other potential implications beyond the problem focused on in this work.

Weaknesses: A big part of the challenge of robust estimation is computationally efficient algorithms and unfortunately the paper does not address the computational complexity of the task. The proposed algorithms in the paper seem to require an exponential amount of time to perform. Given the potential computational inefficiency of the algorithms, the improvement over previously known bounds in the non-contaminated case is less impressive as the previously known bounds cited in the paper are shown for computationally efficient algorithms.

Correctness: I have not verified all the claims but they appear correct.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Questions/Suggestions: 1. Could you comment on the computational complexity of your estimators for G_p,d and G_p,k in the non-asymptotic setting? 2. In the non-contaminated setting, it is known that model width works as a more relaxed parameter of complexity of learning as opposed to maximum degree. Can something similar be said here? 3. It seems like one of the significant contributions of the paper is the modulus of continuity bound under Dobrushin’s condition. Where else within the Ising model literature might this bound have applications? Minor Typos: 1. Line 208: “leadins” -> “leads” ------ Post-Author Feedback------ Thank you to the authors for providing answers to some of my questions. I have carefully considered the response and I do feel the lack of computational efficiency is still a weakness although a statistical understanding of what is possible is an important first step. Given that, the modulus of continuity bound they provide has other applications which they expand upon in their response. This is an interesting result the authors show which makes a case for the acceptance of the paper. Overall, I am inclined to maintain my score for the paper.

### Review 4

Summary and Contributions: This paper considers the task of information-theoretically learning Ising models with an unknown graph structure in Huber’s eps-contamination model. The paper characterizes the TV modulus of continuity for learning Ising models in Theorem 1 and Lemma 1, which yield information theoretic lower bounds on the possible rates of estimation in terms of eps. It then provides and analyzes two different inefficient algorithms for estimation that achieve the optimal estimation rate in eps: (1) a TV projection estimator and (2) an algorithm estimating the neighborhood of each node with robust sparse logistic regression. These estimators are analyzed in the two different parameter regimes where the total number of edges of the underlying graph is bounded by a parameter k and when the max degree is bounded by d. As discussed in the paper, the estimator also recovers an improved information-theoretic upper bound in latter case even when eps = 0, in particular obtaining a better dependence on d. UPDATE: Thank you to the authors for the response and changes. While Dobrushin's condition is popular, it's unclear that it's necessary in this information-theoretic rate a priori (maybe especially given that the Lindgren result does not need this?). However, it is satisfying that the paper has some partial results towards removing Dobrushin's condition in the appendix. My review remains that this paper should be accepted.

Strengths: The paper involves a number of interesting techniques and proofs. The proof of Theorem 1 seems nontrivial and is a large part of the supplementary material. As mentioned above, the information-theoretic upper bounds in this paper seem to improve upon the best known upper bounds even in the uncontaminated case. It seems as though the only known upper bounds in the contaminated case are the computationally efficient upper bounds in Lindgren et al. that achieve an error of sqrt(eps). So, this paper improves upon the best known information theoretic upper bounds substantially.

Weaknesses: The paper restricts its attention to the high-temperature regime. A priori, recent results for learning Ising models do not seem to need this kind of assumption. In contrast, the paper does not seem to justify that assuming Dobrushin’s conditions and being in the high-temperature regime is necessary. In many robust problems, Huber eps-contamination often yields an information-theoretic lower bound of \Omega(eps) on the rate of estimation of the model parameters. For example, this is the case in robust mean estimation, robust linear regression, robust PCA and other similar problems. The fact that this is also the case for Ising models is not surprising. Proofs of these facts in other problems have also established TV moduli of continuity. However, the proof of Theorem 1 here seems to be technically more challenging than analogous proofs for other problems in the literature.

Correctness: The proofs appear to be correct and are well-explained. While this is a purely theoretical paper, it does have synthetic experiments in Section 5 which seem correct and help the reader understand the paper’s theoretical results.

Clarity: The paper is very well written and organized. This is a strength of the paper.

Relation to Prior Work: The relation of the paper to prior work is well-explained in Section 1.1. A more detailed comparison of the upper and lower bounds in Lindgren et al. to those in this work would be helpful. It may also be helpful to discuss in more detail known information-theoretic lower bounds on the sample complexity in terms of k, d, lambda and omega, even if only in the uncontaminated setting for the task of learning uncorrupted Ising models.

Reproducibility: Yes

Additional Feedback: Pg. 3 Line 105, the TV equality should be an inequality