__ Summary and Contributions__: This paper proposes a new model for multivariate data which aims at reducing the complexity of Potts Model for multivariate data. The core idea in the paper is that certain labels are special in multivariate data and the pairwise relationship between variables just depends on the fact whether it has a special label or non-special label and not on which particular non-special label is it.So, the complexity of pairwise term is reduced to an Ising model of binary data. The model still differs from Ising model in the sense that unary potentials are still different for each of the multivariate labels. So, the model complexity and capacity is higher than merging all non-special labels into a single label. The authors also show that MLE in this setting reduces to evaluating multiple logistic regression functions. They show that learning in this setting is highly effective making the learning problem highly efficient and show empirical gains by the approach.
The main contribution of the paper is proposal of a new model and deriving algorithms for learning these by both block coordinate descent and reductions to multiple logistic regression problems.

__ Strengths__: The work seems to catch an interesting insight in real world problems that certain labels are special and the complexity of pairwise multivariate models can be greatly reduced by utilizing this special property. They propose a new model called Potts Ising Model using this property. The examples and use cases for this newly proposed model seems convincing and useful for applicability in the real world for many domains.
The model motivation, description and learning details are specified at appropriate detail making it easy to read till that section.
I particularly like the motivation of the work based on cancer data and that the problem is motivated from a very real world context. The experiments on a real world dataset is a great plus which is missing in many of the current works in the Graphical Models community.

__ Weaknesses__: I believe the experiments are rather weak and I would have liked to see some inference results rather than measuring only MMD on these datasets. I am not sure about the experimental methodology of comparing at distribution level without going to end inference results.
I am not very clear on experimental details. I am assuming the pairwise connections in movie ratings data is considered between variables which have the same movie and variables which have the same user. Is it true or it is considered between all variables ? Also, how d=50 is chosen . Similar holds true for Book ratings dataset.
A great utility of these models occur when the number of variables in the models is increased and there is repetitive structure. I understand the limitation on real world datasets of toxicity but No experiments beyond dimension 50 are provided even for movielens and other datasets. It would be great to have some results and discussion on how these models scale with increasing dimension.
Post Rebuttal:
Thanks for the detailed response to all the reviewers questions. I appreciate your effort on additional experiments and the detailed review. Overall, i liked the approach being simple and motivated from a real world context. I am happy to increase my score based on response and explanation but would definitely appreciate more experiments with larger number of variables.

__ Correctness__: The methods and claims look correct. I am not sure if this is the right empirical methodology to evaluate such results. I would have liked to see some inference results from the learned models.

__ Clarity__: The paper is well motivated but the mathematical notation is confusing in section 3. I would strongly suggest authors to avoid 3 $z_{itk}$ subscript indices. While one is able to figure out the notation when one spends time, I believe it can be highly simplified. I will suggest the authors to refer to the section of potts and ising model in [Koller and Friedman 2009] book for simplifying the notation and increasing accessibility for the reader.
The experiments section have some missing details as specified above

__ Relation to Prior Work__: Yes, I believe the previous work has been discussed in decent detail wrt the problem..

__ Reproducibility__: Yes

__ Additional Feedback__: I still believe some details are missing for experiments. Some of those details are specified above.
Also, the paper says “The sampling parameter is generally taken to be m=1000 in our simulations”. In case of toxicity data, how it is 1000 when total n is of the order of 300. Am i misunderstanding something here ? These experimental details should be clearly specified.

__ Summary and Contributions__: This paper proposes a new model for sparse discrete data by modeling the independent parameters as categorical while simplifying the pairwise parameters to binary variables. The paper shows that both the categorical parameters and the pairwise parameters can be estimated via nodewise logistic regressions or alternating minimization. Finally, the paper presents empirical results on several datasets and compares to previous methods for count data.

__ Strengths__: - Develops two algorithms for optimizing the proposed model.
- Compares to multiple previous methods for discrete data (particularly count-based data). Many good baselines included.

__ Weaknesses__: I appreciated the author response with the new experiments and new evaluation method. The new evaluation idea is very nice and does show a more complete picture. Also, I greatly appreciated the comparison to Copula Multinomial. In terms of computation time, CopMult is much faster (almost trivial) so this should be added to the discussion section but it seems that POIS-g does in fact do better in some circumstances. Also, the interpretability results could still use improvement and validation. I've updated my score based on this.
------ Original review
- From the evaluations, it cannot consistently beat the Ind Mult baseline in terms of MMD. MMD is just as good for Independent Multinomial compared to the POIS approaches.
- Lacks comparison to an important simple baseline: Copula Multinomial (i.e., use multinomial for marginals and Gaussian copula for copula using simple IFM method similar to CopPoi but with multinomial). This would likely be a very difficult baseline to beat as it would perform strictly better than Ind Multinomial. You could also use graphical lasso or the non-paranormal skeptic to estimate the Gaussian copula [1] in high-dimensional regime---and these are known to have very good theoretical rates. Additionally, this could also provide a copula correlation matrix or sparse inverse correlation matrix (similiar to the $\Lambda$ matrix of POIS). Finally, this approach would be super fast as it is probably no more than Copula Poisson + Ind Mult computation times.
- Unclear if learned correlations/dependencies are intuitive/good or just spurious. More quantitative or qualitative justification via interpretation would improve the paper. See correctness below for more details.
[1] Liu, Han; Han, Fang; Yuan, Ming; Lafferty, John; Wasserman, Larry. High-dimensional semiparametric Gaussian copula graphical models. Ann. Statist. 40 (2012), no. 4, 2293--2326.

__ Correctness__: A more thorough understanding of the correlations is still needed.
------ Original review
The evaluation method via MMD seems reasonable. The derivations seem correct.
The paper claims that "POIS uncovers interesting correlation structures in the data". This is not substantiated in any qualitative or quantitative way except by showing the dependency matrices and saying that it has richer structure than spearman correlation but was not interpreted or explained further. These matrices may merely be spurious dependencies. Also, the correct comparison would probably be to the *inverse* of the spearman correlation matrix (which would correspond to the graphical model structure of a Gaussian copula model). Currently, the comparison is probably unreasonable (it's like comparing a matrix to its inverse). I would also want to see how this compares to graphical model structure of the Multinomial Copula model above.

__ Clarity__: The technical part is very difficult to read especially with the many non-standard notations that are introduced. It is almost impossible to keep track of all the non-standard symbols and notation on a first read of the paper. The paper could be improved by simplifying notation. For example, it may be simpler to actually just use a summation instead of the circle plus symbol.

__ Relation to Prior Work__: The relation to prior work is reasonable. However, the paper compares primarily to Poisson-based prior models. It is somewhat unsurprising that Poisson models do not work for these high sparsity datasets. Also, the paper uses review data that is on a scale of 5, which is bounded and thus inherently not like Poisson or negative binomial models. Even a binomial distribution with N=5 may be closer to the true distribution.

__ Reproducibility__: Yes

__ Additional Feedback__: Maybe use $\tilde{z}$ instead of $\sigma$ to help the reader remember that it is related to $z$.
I'd suggest writing the idea of solving using logistic regression as a lemma or proposition. Then, describe the intuitive steps of the proof but put the actual proof in the appendix. The complex details of the proof make the paper hard to read and the derivation itself does not seem particularly insightful.
Why is coordinate descent better than conditional log-likelihood? Is this just because you are jointly optimizing all the parameters instead of splitting them?
An idea for model simplification:
Could you write your model more simply using indicator functions? For example, you could write I(z_i = j) for the independent sufficient statistics and I(z_i \neq 0) I(z_j \neq 0) as the pairwise sufficient statistics.
Typo?:
Lines 204: Should the reference be [12] rather than [19]?

__ Summary and Contributions__: The authors discuss a special case of the Potts model that can be fit using the neighborhood regression framework in much the same way as Ising models (and with similar complexity).

__ Strengths__: - This particular restriction is novel to me and the argument for why this model might make sense in practice is reasonable.

__ Weaknesses__: - The restriction seems quite severe. One wonders if there are even slightly more complicated models that could improve over this result. In particular, the observation that this can be fit almost exactly like an Ising model isn't really surprising.
- The analysis looks at a single data set. More detailed experiments are needed to draw the kinds of conclusions that are suggested by this work. This is particularly important for me as the motivation is that lots of applications have the feature necessary for this model to make sense.
- I like the motivating problem, but it seems that you might want some correlations between the different classes (even if mild) on any real data.
- The technical contributions quite closely follow existing work which limits their novelty.
---Post Rebuttal---
I'm still a bit skeptical that such a coarse approach would really work on a broad range of data sets - though there is enough variety to suggest that perhaps it does. The comment in the rebuttal, "It may not be as severe as it may seem," does little though to help me understand the counterintuitive (at least in my mind) nature of the experimental results. Does changing which level is designated as '0' impact the quality of the results?

__ Correctness__: As far as I can tell.

__ Clarity__: The paper is clear and easy to follow except for a few minor typos here and there.

__ Relation to Prior Work__: There are lots of different special cases of the Potts model considered in the physics literature and beyond. It might be worth citing a few of these just for context.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper presents a new model, the Potts-Ising Model (POIS) to describe survery, rating, and sparse count data. The authors also provide a corresponding algorithm to learn the model from data. The work is highly motivated by a specific example from cancer drug clinical trials.

__ Strengths__: - I find it very interesting and refreshing that the motivating example---toxicity data in cancer clinical trials---is discussed right away. This provides very good motivation for the work and clear set up of broader impacts.
- The model is simple to understand but flexible, and fills a gap between simple Ising models and much more general Potts models.
- Testing on 4 publicly available data sets and a new cancer drug toxicity data set is strong.

__ Weaknesses__: - It would help to have some more justification of the statement, "much of the statistical flexibility of the Potts model is also retained" (line 98).
- Table 1 is hard to read. Authors should use bold text or something else to distinguish best performers in each column.
- A specific example of the "richer correlation structure" found by POIS (line 246) would be more convincing.

__ Correctness__: Yes.

__ Clarity__: Yes, very well written.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: "The POIS" sounds a bit funny as an acronym for the Potts-Ising Model. Why not "PIM"?
UPDATE: This paper stands out to me. It is well-written and highly motivated.