
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper provides exact pvalue for evaluating the significance of some of the scorebased biclustering algorithms. The main result is obtained by noting that, conditioned on the selection event, data matrix $X$ is a constrained Gaussian random variable and straightforward application of the existing theorem from [10] which limits its novelty. Nevertheless, the paper provides the detailed results for scorebased biclustering algorithms which is very relevant and comprehensive.
The paper also characterizes the asymptotic power of the studied algorithms and extends the results to more general setting of multiple embedded matrices.
Here are some minor suggestions/corrections:
In the related work (page 2, line 101), it would be interesting to provide brief discussion regarding the spectralbased algorithms as well.
 Theorem 2.1 from [10] needs to be rewritten. The polyhedral set is wrongly defined. Definition of $\alpha$ seems erroneous (See pages 910 of 10). Also please define the standard CDF $\Phi$. Please provide context to $a$ and $b$ which are the truncating parameters. Also $I_0$ and $J0$ are not introduced in line 186 of p. 4.
 On the simulation result section on page 8 lines 412 and 420, the results are provided for three settings but have been presented for two algorithms only. It is also not clear how the intelligent initialization is outperforming the random initialization for n= 500 or n = 1000 in the k= log n case and better presentation of the graphs may help.
Q2: Please summarize your review in 12 sentences
Interesting paper; While the novelty is limited, the analysis is quite interesting.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors provide a framework for constructing pvalues for the output of a variety of "selection" tasks, focusing on biclusters in particular. Their approach is to characterize the distribution of test statistics as multivariate normal random variables constrained to linear polytopes. This framework allows them to construct exact tests (assuming normality) for subsets of variables that have been selected in different ways.
Quality: This is a quality paper. The contribution is nontrivial, and the material is presented in a way that goes beyond solving "just the bicluster significance problem." (Which is important in and of itself.) Clarity: Good but I have suggestions for improvement; see below. Originality: I believe that the specialization of theorem from reference [10] to biclustering and scan statistics and the rates that go with are novel. Significance: Addressing statistical significance issues in this domain is very important and widely applicable.
Specific comments:
028: '"significant" compared to the rest of X. 063: "...statistically significant bicluster." 088: "We ask whether a submatrix selected by a biclustering algorithm captures the hidden association(s)." I think captures "some" of the hidden association(s) would be more accurate.
The inference is that there is *something* in the computed bicluster that has some signal. In particular, supposing that rows correspond to genes, the specified alpha level does not control false discovery of genes within the computed bicluster; it controls false discovery of an "empty" bicluster. (Similarly, "power" does not indicate the probability of having found the entire bicluster, but only a part of it.) I think it is extremely important to make this clear. If on the other hand I am mistaken about this, I think other readers would be as well.
039: Z ~ N(...) I have to ask the cliched question, what about nonnormal noise models? How sensitive is the procedure to different kinds of nonnormality? Depends on k maybe?
071: "...for the amount of the signal in..." I suggest explicitly defining the selected parameter, maybe before (1.2), so that you can state this formally. I also think it would be worth reiterating that the procedure will trap the *selected* parameter with probability (1\alpha), where the procedure is to run the biclustering algorithm, select the parameter of interest, and then compute/invert the pvalue.
077: "In the supplementary materials, we discuss the problem in the more general setting where there are multiple emnbedded [SIC  fix this] submatrices..." I recommend finding a way to lift this discussion into the main paper, even if you can't find room to give the full details. Not saying that needs to go here exactly, but somewhere in the main paper.
129: Use either the e^TXe form or the trace form, but choose one and use it consistently.
150: "The amount of signal in..." See comment above.
162: "Let eta ... a linear function of y" The "linear function" comment confused me; just let it be a vector.
168: I assume V0 is relevant in other contexts; if it's not relevant for this work perhaps omit?
162174: There is a lot of notation overloading here that is made worse by the fact that vectors are typeset indistinguishably from scalars. In particular, see if there is a way to remove overloading in arguments of F (2.7).
378397: These plots are difficult to distinguish, and I'm not sure what the intended message is. If it is to compare power of random initialization against intelligent initialization, then they should appear on the same plot. Then have different plots for (e.g.) 3 different n.
Q2: Please summarize your review in 12 sentences
A novel, exact hypothesis testing framework for biclusters. Makes a solid contribution.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
First, we wish to thank the reviewers for their
comments and suggestions. The two main concerns of the reviewers seem to
be the novelty and significance of the paper.
1) Although the
proposed pivot is the special case of a more general pivot, most known
applications of the established result are to inference after model
selection in regression. By casting the problem of inference with scan
statistics as a selective inference problem, we are able to form valid
pvalues for any Gaussian scan statistic. Previously, the problem has been
only approached asymptotically (even in the Gaussian case). The standard
theorems in previous work state: " For a given scan statistic and
conditions on the signal strength, then P(type 1 error ) + P( type 2
error) > 0". These results are generally too coarse (since they
combine type 1 and type 2 errors and rely on concentration results,
instead of distributional results) to understand the distribution of the
scan statistic, and cannot be used to compute a pvalue. Furthermore these
results do not apply to adhoc scan statistics that are frequently used,
such as the greedy algorithm. We believe that providing tests with
provable type 1 error control for adhoc procedures is a major
contribution of this work. In other words, this work is concerned with
testing the output of a submatrix/biclustering algorithm. As is standard
in the hypothesis testing setup, we always ensure that the test controls
type 1 error, and under certain assumptions on the signal strength we show
that the type 2 error tends to 0.
2) To our knowledge, there are no
results in the literature on the power of tests based on the general
framework of Lee et al. (2013). We provide two different power results.
The first is for the combinatorial procedure of Theorem 2.5. This result
shows that the selective test results in no loss of power, since the
signal conditions match those in Balakrishnan et al. The second power
theorem applies to the largest row sum/column sum test in Theorem 4.1.
This test is computationally tractable and attains the optimal signal
conditions among computationally tractable procedures, assuming the
planted clique conjecture ( Ma and Wu 2013). This can be seen by letting
\mu=constant, which corresponds to k=sqrt(n logn). Theorem 4.1 shows that
the power tends to 1 in this regime. This is optimal up to the sqrt(
logn) since assuming planted clique there is no computationally tractable
procedure for detecting in the regime mu=constant and
k=o(sqrt(n)).
Reviewer 1: 1. Although we don't study the power
of the greedy algorithm, we study the power of a computationally feasible
algorithm in Theorem 4.1.
2. Yes, there is a typo in the
definition of C.
Reviewer 2: 1. We will add discussion of
spectral algorithms. 2. Yes, Theorem 2.1 has a typo. We will clarify
$\Phi$, I_0, and J_0. 3. The reviewer is correct that when k=logn, the
intelligent initialization is not helping. As we can see from Theorem 4.1,
the row/column test requires that k~ sqrt(n) when mu is a constant. Thus
in the3 plots of k=sqrt(n) and k=.2n , we see that the intelligent
initialization is helping.
Reviewer 3: 1. We will clarify that
a false discovery is an empty bicluster, and does not refer to the
proportion of genes within the computed bicluster. 2. Under certain
assumptions, the approach also gives asymptotically valid inference when
the noise is nonGaussian. Theoretical justification for the appeal to
asymptotics is described in Tian and Taylor (2015) . 3. We will
incorporate your suggestions to make the writing more clear.
X.
Tian and J.E. Taylor, Asymptotics of selective inference, arXiv preprint
arXiv:1501.03588.
Reviewer 5. We will try to redo the exposition
in Section 2 to help give context and motivation.
Reviewer 6.
Thank you for your comments. We agree that the importance of this work
is to complement existing work that only tests for biclusters without
finding them, or only provides sufficient conditions for correct
estimation. Our work allows for testing what an algorithm
finds.

