
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper introduces the class of Poisson DAG models, directed graphical models where each node is conditionally Poisson on its parents, e.g. through a GLM model. This particular subclass of models may be of use in applications where count data is frequent, e.g. bioinformatics (although the authors don't really do a good job of motivating the approach). The authors prove using an intuitive notion of overdispersion that the causal ordering in these models is identifiable, and then provide an algorithm which shows impressive performance in the simulations results, and with theoretical guarantees (although I wonder how tight the bound in Thm 4.2 is, it looks to me like it just tells me that a probability is smaller than 1 in many cases). My reservations on the paper are the following:  it is not an easy paper to read, motivation is minimal, discussion is absent. The notation around eqn (4) beat me, I could not find the definition of C_{jk} and anyway I don't get what it means x\in X_{C_{jk}} when x should then be an integer number as it is a value taken by a Poisson variable. Also, I appreciate that the authors focus on the novel step 2 of the algorithm, but a minimal discussion of the other steps would not go amiss.  evaluation, this is impressive but it does rely heavily on a model correctness assumption. I would have liked to see how the algorithm behaves under model mismatch (e.g. you tell it maximum indegree 2 but actually the data comes from an indegree 3 network)  the paper clearly has its strengths in the theoretical/ algorithmic part, however an example that had at least some motivation in reality would be good. Having said these things, I think the idea is novel and interesting and this could be a quality contribution.
Q2: Please summarize your review in 12 sentences
A paper presenting interesting novel theoretical results plus an algorithm for learning DAG models with conditionally Poisson variables. I found it worthwhile even though there could be improvements.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper extends graphical models determined by acyclic digraphs (DAGs) to count data by using Poisson models. The basic assumption that variables that are conditionally Poisson are overdispersed relative to variables that are marginally Poisson is not connected to the notion of conditional (in)dependence, which is IMO a basic feature to show in any new formulation of a graphical model. In the numerical experiments, the authors compare their approach to other methods that, to my understanding, they are not prepared to work with count data. So, it is necessary that the authors give more detail about how those methods were run on those data, for instance, were the count data logtransformed?
Q2: Please summarize your review in 12 sentences
This paper presents an algorithm to learn DAG models from count data using Poisson distributions. The key methodological contribution is to assume that variables that are conditionally Poisson are overdispersed relative to variables that are marginally Poisson. While the problem and the approach based on Poisson models is appealing, the paper IMO seems unfinished.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper considers the inference of the structure of the DAG underlying a Poisson directed graphical model. The proposed strategy relies on the fact that nodes with parents are conditionally Poisson, whereas nodes without parents are marginally Poisson. The paper aims at distinguishing the former from the later based on an overdispersion criterion.
I do like the principle of the approach but the construction is only partly convincing. My main concerns are the following:
1  I am not completely clear about the proof if identifiability. The proof of Thm 3.1. starts with "assume that the true causal ordering \pi^* is {1, 2, ...p}". To me, in a general oriented graphical model, there may be several true causal orderings. As the proof relies on induction, this should be clarified.
2  The exploration of the DAG space is indeed a critical task and the authors simply suggest to use GLMlasso at each step to reduce the search space. The objective function of the GLMlasso is not necessarily consistent with the overdispersion criterion, so the global procedure is a bit heuristic.
3  In the simulation study, I did not understand why all regression coefficients $\theta_{jk}$ are chosen to be negative (between 1 and .7). This is allthemore surprising as the authors states that Yang requires this negativity constraint, while they do not. Does this choice make the problem easier?
4  As for the performances, indeed they are good will large $n$, even for fairly large $p$, by I have been very surprised by the bad performances for $p = 10$ and $n = 500$ which seems to be a rather comfortable case. Do the authors have an idea why they fail in such a, apparently, easy situation?
Minor remarks:
1  In the proof of Thm 3.1 should "Pa(j)/{1, 2, ..., m}" be "Pa(j)\{1, 2, ..., m}"?
2  The threshold $c_0$ is likely to have some influence on the results, which is not explored in the simulation study.
Q2: Please summarize your review in 12 sentences
The paper phrases an interesting problem in a clever and potentially efficient way. However the proposed algorithm is partly adhoc and the simulation is not fully convincing.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their efforts and
comments for Paper 437. We are grateful that reviewers found the paper
original, interesting, and noted that it tackles an important problem.
There were a few questions raised by the reviewers which we clarify below
in detail.
Rev #2 raises a question about notations in eq. 4.
We are thankful for Rev #2 for raising this issue. We agree that the
notation of C_{jk} should be much clearer and we'll address this in the
cameraready version, if the paper is accepted. To clarify, the goal of
our algorithm is to find the causal ordering and then the set of parents
for each node. C_{jk} defines a set of candidate parents which is a
superset of the actual parents. The subscript j runs from 1 to p and
denotes the index of the element of the causal ordering we are searching
for. k (which depends on j) denotes a node index in the set of candidate
nodes for the j^th element of the causal ordering. Hence, C_{jk} is the
set of candidate parents corresponding to the k^th node index for the j^th
element of the causal ordering.
Rev #2 asks how the algorithm
behaves under the different settings and model mismatch. We have
considered a lot of different settings such as different link functions,
different ranges of parameters \theta and different indegree. In short,
our algorithm works well as long as Assumption 4.1 is satisfied. We will
discuss and address model mismatch in the cameraready version if the
paper is accepted.
Rev #3 comments that our comparison methods in
simulations are not appropriate for the Poisson count data. We thank
Reviewer 3 for his/her comments and feedback. We agree that more details
are required to explain how the algorithms we present are adapted to count
data. In short scorebased methods are adapted by using the Poisson
loglikelihood score. Constraintbased methods such as the PC algorithm
are adapted using a conditionalindependence test based on mutual
information. We thank the reviewer for pointing this out and will include
these details in the cameraready version of the paper if it is
accepted.
Rev #3 also comments that we do not link our DAG model to
conditional independence. We thank Reviewer 3 for pointing this out.
We do not comment on it explicitly but it is implied by the factorization
(see e.g. Lauritzen '96). We will clarify this point in the cameraready
version of the paper if it is accepted.
Rev #4 asks the clarity of
the proof of identifiability condition. We are grateful for the
insightful questions raised by Rev #4. As Rev #4 mentions, the true causal
ordering may not be unique. However, the mathematical induction is still
valid for any of the valid causal orderings. For any true causal
orderings, the first component satisfy the property that the expectation
is same as variance otherwise overdispersed. Given the chosen first
component of causal ordering the second elements of the true causal
ordering hold the property that conditional expectation is equal to
conditional variance and otherwise overdispersed. We will make this more
explicit in the cameraready version if the paper is accepted.
Rev
#4 brings up a problem that an objective function of the GLMlasso is not
necessarily consistent with the overdispersion criterion, so the global
procedure is a bit heuristic. We agree that the GLMLasso procedure is a
bit heuristic since the GLMLasso procedure is not necessarily consistent.
Due to space constraints, we do not address this issue. However in a
longer version of the paper (to be posted on arxiv shortly), we prove that
our heuristic is suitable when Assumption 4.1 is satisfied. We will add a
comment to that effect in the cameraready version if the paper is
accepted.
Rev #4 raises a questions about our simulation settings.
On the first question regarding the choice of $\theta_{jk}$'s to be
negative, we are ensuring Assumption 4.1 is satisfied. The choice of
negative $\theta_{jk}$'s does make the problem easier since Assumption 4.1
is guaranteed to be satisfied however there are settings in which some
$\theta_{jk}$'s are positive and Assumption 4.1 is satisfied.
On
the second question regarding the $p=10$ and $d = 3$ case, our algorithm
is working in some high dimensional settings. However, the algorithm needs
a large sample size, especially when $d$ is large. As we mention, we need
approximately $n = \Omega(\log p^{6d})$ which is extremely large when
$d=3$. This problem does not only apply to our algorithm but also for
other learning DAG algorithms such as PC, MMHC algorithms in which the
edges are determined by a conditional independence test. We will make both
of these points more explicit in the cameraready version if the paper is
accepted.
We would also like to thank Reviewers #57 for their
positive reviews and feedback.

