NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 1646 Unsupervised Learning from Noisy Networks with Applications to Hi-C Data

### Reviewer 1

#### Summary

The authors describe an optimization approach to identify clusters (communities) in a network while combining multiple noisy versions of the network. In addition their framework allows to incorporate high confidence evidence (though I’m not convinced that this part of their method is very effective). The authors apply their method to combine multiple resolutions of Hi-C data and analyze their approach on a simplified setting of disjoint blocks and in the presence of increasing amount of noise.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 2

#### Summary

This work focuses on the analysis of Hi-C data (devoted to DNA sequence topology measures). To do that, it proposes an optimisation with constraints (or penalised add-ons) framework to reconstruct a network (OPT1 scheme). It is then modified to include multi-resolution data (OPT2) and then to include high-confidence edges in a semi-supervised framework (OPT6). The method is applied to data for which a Gold Standard is available.

#### Qualitative Assessment

Interesting method tailored for a specific data set, clearly motivated. Although the method can certainly be applied in many different scenarios. Since no theoretical results are available, maybe a simulation study to compare to state-of-the-art might be useful (not only same method without refinements). After all, your method is more an add-on in terms of penalised term to an existing optimisation scheme. Even if it is clearly stated. Some questions/remarks: - l44: aren't links from regulatory elements to targets? Explain which components are considered as sources/targets please. - end of 1.1/beg. 1.2: a bit disturbing that you motivate existing work limitations (concept is not even that clear) in 'Our Contribution' section. - l87-88: maybe cast the ability of including high-confidence edges in the semi-supervised literature? - l95-96: compared to what? - l99: better explain what the $n$ elements in the network are, i.e. what the nodes represent. - l104: could you explain what it means that the network of interest S is low-rank please? - l111: important question here. Projection the current solution $S$ on the space of symmetrical matrices is probably sub-optimal. Can you comment on potential pitfalls/solutions at least? (e.g. see Hiriart-Urruty and Lemaréchal, Convex analysis and minimization algorithms, 1993) - OPT2 optimisation scheme makes me think of multiple network inference (from multiple data sets) as in Chiquet et al. Statistics and Computing 2011. Some comments? Entropy $P(\alpha)$ seems to favour using one source (depending on value for $\gamma$). Comments? - l190: not sure the number of communities concept $C$ has been introduced beforehand. - l194: not clear, compared to what? - l216: so no theoretical guarantees for the algorithm to converge right? - l259: computing time comparison? - End of conclusions: future work?

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

This paper presents a machine learning method to infer an interaction network and its community structure. The novelty of this method is that it is able to combine multiple noisy inputs at different resolutions and learn the underlying denoised network. The authors apply the method to denoising chromosome capture experiment data and calling chromosome contact domains from it. They demonstrate the potential application of this method in identifying genomic loci clusters with biological significance.

#### Qualitative Assessment

Overall, the article is well written and the intuition behind the method is well explained. However, the authors should address the following questions in the paper: 1) Bulk chromosome capture experiment data are typtically averages over the chromosome structures of multiple cell populations. One should expect that the community structure of the genomic loci interaction network is also a population average. This poses a challenge in partitioning the nodes into the correct community. For example, both Sij and Sik can be arbitrarily large, e.g., much larger than the average Savg, while Sjk is arbitrarily small, which means the Euclidean metric proposed in the low-dimensional embedding term L(S,F) might not be obeyed. 2) Capture-C exp. data from different samples are not directly comparable due to the unknown mapping from read count space to the intrinsic loci interaction space. A simple linear combination of different data sets doesn't always make sense. Some sort of normalization of the input matrices W is necessary, but this is never mentioned in the current version of the paper. 3) Fig. 1(d) shows two abrupt interaction bands in the lower right corner of the matrix between the 3rd and 4th mega-domains -- it reads as if a set of loci from a domain interact strongly with a small cluster of loci from another domain but do not at all interact with the loci up- and downstream of this cluster. One would expect the interaction intensity dies off somewhat gradually as a function of the genomic distance. This abrupt signal doesn't show up in the original data. Could this be an artifact of the denoising? 4) The authors generated a "ground truth" for benchmarking the performance of the method by "visually" extracting the clusters, but this is very subjective and it was never mentioned what visualization tools were used to aid this. How do we know, for example, if the algorithm is assigning boundaries (true positive) that the authors failed to identify? 5) More specific information regarding the performance of the method should be given. For example, judging from Fig 2(c), it appears that a lot of false positives appear in the prediction compared to the "ground truth" so it would helpful to plot the precision-recall curve as a function of the noise level to better understand the performance in the different settings. Also, since the "ground truth" identification is subjective, it might be more informative to compare the performance of this method to other methods in the literature. An alternative comparison would be to examine the overlap of the identified boundaries between different noise levels to see if the method can predict a consistent set of domains under different noise levels. Another issue is that the noise model used in the benchmarking and what we should exepct the identified domain boundary to change were not mentioned. 6) It might be helpful to compare the computational expense of this method in various settings. For example, how much more expensive it would be to combine all the possible inputs as compared to just a few?

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Summary

The manuscript presented a method for community detections from noisy network data from Hi-C data. The method integrates multi-resolution interaction networks and high confidence partial labels from Capture-C data.

#### Qualitative Assessment

Firstly, there isn't sufficient comparisons against existing approaches give that community detections being a relatively mature field. Secondly, the experiments doesn't support the generality of the method. The author only tested on their method on one dataset, and it's unclear how representative the dataset is. Overall, I don't think the present manuscript provides enough evidence to be considered for publication.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 5

#### Summary

In this paper, the authors present an unsupervised learning framework which denoises complex networks. The framework consists on several optimization problems, introduced incrementally, from the simplest case to the most complex. More concretely, the authors introduce modifications to handle multi-resolution networks and partial labels. An explanation of how the optimizations are implemented is provided in every case. The authors validate empirically their method by conducting community detection experiments on Hi-C and capture-C data from chromosomes.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 6

#### Summary

The contribution of this paper is to develop a method which integrates experimental data from multiple resolution chromatin conformation maps to result in a better quality interaction network. The method first compiles information from several data sources into a single incidence matrix on which additional methods for denoising could then be applied.

#### Qualitative Assessment

To my understanding, the authors develop a method for combining interaction data from multiple datasets that contain differing amounts of noise and resolution information through alternating optimization of S, F, and alpha conditioned on W1-Wm in which S and W1-Wm are the incidence matrices and the goal is to maximize the Frobenius product of S and W [to enforce consistency of the denoised network S with several datasets W1-Wm (each with learned weighting parameter alpha_1-alpha_m)] in conjunction with regularization with an auxiliary matrix F [to enforce S is low rank]. The authors test the performance of this method on data derived from multi-scale chromosome conformation data from a lymphoblastoid human cell line that has been visually annotated to recover ground truth clusters and then introduce additional noise, which shows good performance. More thorough analysis of the expected noise properties and ground truth for the chromatin conformation data would be helpful to know how well the method will generalize to other experimental datasets.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)