{"title": "Unsupervised Learning from Noisy Networks with Applications to Hi-C Data", "book": "Advances in Neural Information Processing Systems", "page_first": 3305, "page_last": 3313, "abstract": "Complex networks play an important role in a plethora of disciplines in natural sciences. Cleaning up noisy observed networks, poses an important challenge in network analysis Existing methods utilize labeled data to alleviate the noise effect in the network. However, labeled data is usually expensive to collect while unlabeled data can be gathered cheaply. In this paper, we propose an optimization framework to mine useful structures from noisy networks in an unsupervised manner. The key feature of our optimization framework is its ability to utilize local structures as well as global patterns in the network. We extend our method to incorporate multi-resolution networks in order to add further resistance to high-levels of noise. We also generalize our framework to utilize partial labels to enhance the performance. We specifically focus our method on multi-resolution Hi-C data by recovering clusters of genomic regions that co-localize in 3D space. Additionally, we use Capture-C-generated partial labels to further denoise the Hi-C network. We empirically demonstrate the effectiveness of our framework in denoising the network and improving community detection results.", "full_text": "Unsupervised Learning from Noisy Networks with\n\nApplications to Hi-C Data\n\nBo Wang\u21e41, Junjie Zhu2, Oana Ursu3, Armin Pourshafeie4, Sera\ufb01m Batzoglou1 and Anshul Kundaje3,1\n\n1Department of Computer Science, Stanford University\n\n2Department of Electrical Engineering, Stanford University\n\n3Department of Genetics, Stanford University\n4Department of Physics, Stanford University\n\nAbstract\n\nComplex networks play an important role in a plethora of disciplines in natural\nsciences. Cleaning up noisy observed networks poses an important challenge in\nnetwork analysis. Existing methods utilize labeled data to alleviate the noise the\nnoise levels. However, labeled data is usually expensive to collect while unlabeled\ndata can be gathered cheaply. In this paper, we propose an optimization framework\nto mine useful structures from noisy networks in an unsupervised manner. The key\nfeature of our optimization framework is its ability to utilize local structures as\nwell as global patterns in the network. We extend our method to incorporate multi-\nresolution networks in order to add further resistance in the presence of high-levels\nof noise. The framework is generalized to utilize partial labels in order to further\nenhance the performance. We empirically test the effectiveness of our method in\ndenoising a network by demonstrating an improvement in community detection\nresults on multi-resolution Hi-C data both with and without Capture-C-generated\npartial labels.\n\nIntroduction\n\n1\nComplex networks emerge in a plethora of disciplines including computer science, social sciences,\nbiology and etc. They entail non-trivial topological features and patterns critical to understanding\ninteractions within complicated systems. However, observed networks from data are typically noisy\ndue to imperfect measurements. The adverse effects of noise pose a critical challenge in unraveling\nclear structures and dynamics in the networks. Therefore, network denoising can strongly in\ufb02uence\nhow the networks are interpreted, and can signi\ufb01cantly improve the outcome of down-stream analysis\nsuch as global and local community detection.\nThe goal of community detection is to identify meaningful structures/communities underlying the\nprovided samples in an unsupervised manner. While the performance of community detection\nalgorithms can worsen due to noise [1], one may use prior knowledge about the structure of the\ncommunities, such as the presence of clusters, to recover local networks [25]. In addition to the\nspecial structure that one may expect in a network, a small portion of high con\ufb01dence links may be\navailable. The combination of the special structure and the con\ufb01dent links can be used to denoise\nthe network that might include both noisy or missing links. How to incorporate multiple sources\nof information to construct a network has been widely studied in the context of data fusion or data\naggregation [3].\nBiology offers a special case where overall structure of the network of interest might be known\nfrom the science but the data may be riddled with noise. One example of this is the 3D structure, or\nfolding, of DNA. In biology, this structure is important as, among other things, the DNA topology\n\n\u21e4bowang87@stanford.edu\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fhas been shown to have a fundamental impact on gene expression in biological processes [4]. For\nexample, many genes are in 3D contact with genomic regions that are far away in the linear genome\nbut close in 3D space. These genomic regions contain regulatory elements that control when the\ngene is active [5, 6, 34]. The rules by which regulatory elements come in contact with their target\ngenes are still unclear [7]. While the exact mechanism for this selective interaction between the\nregulatory elements and the target genes is unknown, the 3D organization of the genome in domains\nof interaction seem to play a crucial role. Furthermore, topologically associated domains (TADs) [8],\nwhere local interactions are observed, are of potential biological interest as they have been shown to\nbe conserved between mice and humans, which suggests an ancient root for higher order structures in\nthe genome [8].\nThe interaction map of regulatory elements in a genome can be viewed as a network, where each\nnode is a regulatory element and each link represents the interaction strength between two of the\nelements. In this context, we not only have prior knowledge about the types of structures in this\ninteraction network, but we also have various types of noisy and incomplete observations of the links\nin this network based on recently-developed technologies.\nOne approach to observe these links is Hi-C, an experiment which uses high-throughput sequencing to\nconstruct a 2D contact map measuring the frequency with which pairs of genomic regions co-localize\nin 3D space. The results of Hi-C experiments can be summarized at multiple resolutions (lower\nresolutions are obtained by binning genomic regions together), ranging from 1 kb to 1 Mb[8\u201310].\nHigher resolution maps capture more \ufb01ne grained location interactions but the background noise\ngenerated by random collisions can be larger [11]. Lower resolution maps have less noise, but at\nthe cost of losing the exact localization of genomic contacts. In addition to Hi-C, there are other\nexperiment variants such as 4C, 5C and Capture-C technologies, which provide a new window into\ndetection of a small number of interaction links with high con\ufb01dence [12\u201315], by focusing sequencing\nresources on a selected subset of contacts. For these experiments the increased con\ufb01dence comes at\nthe cost of not measuring the full contact map. Thus, the integration of multi-resolution noisy Hi-C\ndata with high-con\ufb01dence Capture-C data is not only an interesting problem in the context of general\nnetwork densoising but it is also biologically relevant.\n\n1.1 Related Work\nMany applications in biology utilize multiple measurements to construct a single biological network.\nGeneral approaches such as [16] have relied on speci\ufb01c models to reveal structures from the multiple\nmeasurements. However, some biological networks do not \ufb01t these model assumptions (e.g. Gaussian).\nFor example, while Hi-C data can be summarized at multiple resolutions, standard model assumptions\nare not appropriate for combining the resolutions.\nFurthermore, one may acquire a small subset of highly con\ufb01dent measurements. In the case of Hi-C\ndata, this can be done through Capture-C [12, 13, 15] technologies. While matrix completion is a\nwell studied problem [2] to recover missing measurements, the setting with Capture-C is slightly\ndifferent. In particular, the number of highly con\ufb01dent entries for the n \u21e5 n adjacency matrix of\nrank r may be less than nr log n, which is suggested for matrix completion [2]. Additionally, such\na method would not take advantage of the large amount of data available, albeit with higher noise,\nfrom different sources.\nA common application of a denoised networks is to more reliably detect biologically-relevant\ncommunities. General community detection methods have been used to \ufb01nd protein complexes\n[17, 18], genetically related subpopulations [19], like-minded individuals in a social network [20],\nand many other tasks [21, 22]. Aside from the all-purpose algorithms mentioned above, there are\nspecialized algorithms for the problem of community detection in Hi-C data. Rao et al. de\ufb01ne TADs\nusing the specialized Arrowhead algorithm [10]. Cabreros et al. used a mixed-membership stochastic\nblock model to discover communities in Hi-C data [23]. Their method can detect the number of\ncommunities or can be forced to \ufb01nd a speci\ufb01ed number of communities. Dixon et. al de\ufb01ned a\ndirectionality index that quanti\ufb01es the asymmetry between the upstream and downstream interaction\nbias for a position [8]. A hidden Markov model was subsequently used to detect biased states based\non these scores [8].\n\n1.2 Our Contribution\nAs mentioned above, Hi-C data can be represented by many different resolutions. Although the\ndata and noise from these resolutions are not independent, the different resolutions still contain\n\n2\n\n\finformation that can help denoise the data. We propose a model-free optimization framework to\nextract information from the different resolutions to denoise the Hi-C data. While generic community\ndetection methods are limited to using only a single input network, our optimization framework is\nable to pool data from different resolutions, and produces a single denoised network. This framework\nallows us to apply community detection methods to multi-resolution Hi-C data.\nFurthermore, in special cases, a subset of the interaction network may be known with a high con\ufb01dence\nusing Capture-C [12]. To our knowledge, there is no algorithm with the capability of taking advantage\nof this highly con\ufb01dent set to improve the denoising of Hi-C data. Our framework is able to take\na multi-resolution network in addition to the con\ufb01dent set of data to denoise the corresponding\nnetwork. Applying our framework to datasets with simulated ground-truth communities derived from\nchromosomes 14 and 21 of GM12878 in [10], we \ufb01nd that our framework can indeed leverage the\nmultiple sources of information to reveal the communities underlying the noisy and missing data.\n2 Problem Setup\n2.1 General Learning Framework\nThroughout this paper, we will use a real and symmetric n \u21e5 n matrix to represent a network on n\nnodes. Accordingly, the (i, j)th entry of the matrix will be used to denote the weight or intensity of a\nlink between node i and node j.\nSuppose we want to construct a weighted network S 2 Rn\u21e5n from a noisy observation W 2 Rn\u21e5n\non the same nodes, where the noise introduces false-positive and false-negative links. If the network\nof interest S is low rank, then this inherent structure can be used to denoise W . This intuition that the\ndetected noisy matrix could lie near an underlying low-rank or sparse matrix is also key to subspace\ndetection algorithms, such as Sparse Subspace Clustering [25] and Low-Rank Representation [26].\nWe use this intuition to formulate our optimization framework below:\n\nminimize trW >S + L (S, F ) + ||S||2\nwith respect to S 2 Rn\u21e5n, F 2 Rn\u21e5C\nsubject to F >F = IC, Xj\nwhere L(S, F ) = tr(F >(In S)F ),\n\nF\n\nSij = 1, Sij 0 for all (i, j),\n\n(OPT1)\n\nhere , > 0 are tuning parameters (see Appendix 7.3). F is an auxiliary C-dimensional variable\n(with C < n) and is constrained to consist of orthogonal columns. S is constrained to be a stochastic\nmatrix and further regularized by the squared Frobenius norm, i.e. ||S||2\nF .\nIn order to represent the resulting denoised network, the solution S can be made symmetric by\n(S + S>)/2. In addition, the objective and constraints in (OPT1) ensure two key properties for S to\nrepresent a denoised network:\nProperty (1): S complies well with the links in network W .\nThe \ufb01rst term in the objective function of (OPT1) involves maximizing the Frobenius product of S\nand W , i.e.,\n\ntr(W >S) = Xi,j\n\nWijSij.\n\nso each link in S is consistent with W . Taking the sum of the element-wise products allows S to be\ninvariant to scaling of W .\nProperty (2): S is low rank and conveys cluster structures.\nThe term L(S, F ) in (OPT1) is an imposed graph regularization on S so that it is embedded in a\nlow-dimensional space spanned by F . To see this, \ufb01rst note that (In S) is the graph Laplacian of S\nas the row sums (and column sums) of S is 1. It can be shown that\n\nL(S, F ) = tr(F >(In S)F ) =Xi,j ||fi fj||2\n\n2Sij,\n\nwhere fi and fj are the ith and jth rows of F respectively. Thus, each row of F can be interpreted\nas a C-dimensional embedding of the corresponding node in the network. Here, || \u00b7 ||2 denotes the\n`2-norm, so the minimization of L(S, F ) enforces link Sij to capture the Euclidean distance of node\ni and node j in the vector space spanned by F .\n\n3\n\n\f2.2 Learning from multi-resolution networks\nThe general graph denoising framework above can be easily extended to incorporate additional\ninformation. Suppose instead of a single observation W , we have m noisy observations or repre-\nsentations of the underlying network S. Denote these observations as W1, ..., Wm. We refer to this\nmulti-resolution network as W , where each link in W contains m different ordered values. (This\nterminology is not only used to conveniently correspond to the Hi-C interaction maps at different\nresolutions, but it also helps to remind us that the noise in each network is not necessarily identical\nor stochastic.) A multi-resolution network consists of different representations of S and provides\nmore modeling power than a single-resolution network [32]. We can use this additional information\nto extend (OPT1) to the following optimization problem:\n\nwith respect to S 2 Rn\u21e5n,\n\nminimize tr\u2713\u21e3X`\nsubject to F >F = IC,Xj\nwhere L(S, F ) = tr(F >(In S)F ), P (\u21b5) =X`\n\n\u21b5`W`\u2318> S\u25c6 + L (S, F ) + ||S||2\nF 2 Rn\u21e5C,\u21b5\nSij = 1,X`\n\n\u21b5` log \u21b5`,\n\n2 Rm\n\u21b5` = 1, Sij 0 for all (i, j),\u21b5 ` 0 for all `.\n\nF + P (\u21b5)\n\n(OPT2)\n\nwhere , , > 0 are tuning parameters (see Appendix 7.3). The vector \u21b5 = [\u21b51, ...,\u21b5 m]> weights\nthe m observed networks W1, ..., Wm and needs to be learned from the data.\nThe modi\ufb01cation to the \ufb01rst term in the objective in (OPT1) from that in (OPT2) allows S to\nsimultaneously conform with all of the networks according to their importance. To avoid over\ufb01tting\nwith the weights or selecting a single noisy network, we regularize \u21b5 via P (\u21b5) in the objective of\n(OPT2). In our application, we chose P (\u21b5) so that the entropy of \u21b5 is high, but one may select other\npenalties for P (\u21b5). (e.g. L1 or L2 penalties)\nWhile (OPT2) is non-convex with respect to all three variables S, L, \u21b5, the problem is convex with\nrespect to each variable conditional on \ufb01xing the other variables. Therefore, we apply an alternating\nconvex optimization method to solve this tri-convex problem ef\ufb01ciently. The three optimization\nproblems are solved iteratively until all the solutions converge. The following explains how each\nvariable is initialized and updated.\n\n(1) Initialization.\n\nThe variables S, L and \u21b5 are initialized as\n\n\u21b5(0) =\n\n1\nm\n\n1m, S(0) =X`\n\n\u21b5(0)\n` W`, F (0) = [v(0)\n\n1 , ..., v(0)\nC ]\n\nwhere 1m is a length-m vector of ones, i.e, 1m = [1, ..., 1]>. The weight vector \u21b5 is set to be\na uniform vector to avoid bias, and S is initialized to be the sum of the individual observed\nnetworks Wi according to the initial weights. Finally, F is initialized to be the top C eigenvectors\nof S, denoted as v(0)\n\n1 , ..., v(0)\nC .\n\n(2) Updating S with \ufb01xed F and \u21b5.\n\nWhen we minimize the objective function only with respect to the similarity matrix S in (OPT2),\nwe can solve the equivalent problem:\n\n\u21b5` (W`)i,j + (F F >)i,j\u2318 Si,j + Xi,j\n\nS2\ni,j\n\n(OPT3)\n\nwith respect to S 2 Rn\u21e5n\n\nminimize Xi,j\u21e3X`\nsubject to Xj\n\nSij = 1, Sij 0 for all (i, j).\n\nThis optimization problem is clearly convex because the objective is quadratic in Si,j and the\nconstraints are all linear. We used the KKT conditions to solve for the updates of S. Details of\nthe solution are provided in Appendix 7.1.\n\n4\n\n\f(3) Updating F with \ufb01xed S and \u21b5. When we minimize the objective function only with respect to\n\nthe similarity matrix F in (OPT2), we can solve the equivalent problem:\n\nminimize\n\ntr(F >(In S)F )\n\nwith respect to F 2 Rn\u21e5C\nsubject to F >F = IC.\n\n(OPT4)\n\nThis optimization problem can also be interpreted as solving the eigenvalue problem for (S In)\nbecause the trace of F >(S In)F is maximized when F is a set of orthogonal bases of the\neigen-space associated with the C largest eigenvalues of (S In). We used standard numerical\ntoolboxes in MATLAB to solve for the eigenvectors.\n\n(4) Updating \u21b5 with \ufb01xed F and S.\n\nNow treating S and F as parameters, the equivalent problem with respect to \u21b5 becomes a simple\nlinear programming problem:\n\n\u21b5` log \u21b5`\n\n(OPT5)\n\nUsing the optimality conditions, we derived a close-form solution for \u21b5` for each `:\n\nminimize X`\nwith respect to \u21b5 2 Rm\nsubject to X`\n\n(W`)i,jSi,j + X`\n\n\u21b5`Xi,j\n\u21b5` = 1,\u21b5 ` 0 for all `.\nexp\u21e3Pi,j (W`)i,j Si,j\n\u2318\nP` exp\u21e3Pi,j (W`)i,j Si,j\n\n\u2318 ,\n\n\n\n\n\n\u21b5` =\n\nDetails are provided in Appendix 7.2.\n\n(5) Termination.\n\nThe alternating optimization terminates when all three variables S, F , and \u21b5 converge. Even\nthough alternating optimization techniques are widely-used heuristic approaches, the parameters\nconverged in approximately 20 iterations in the applications we have considered.\n\n2.3 Learning from multi-resolution networks and highly con\ufb01dence links\nNow suppose in addition to a multi-resolution network, we are given noiseless (or highly con\ufb01dent)\ninformation about the presence of certain links in the network. More formally, we are given a set P,\nsuch that if a link (i, j) 2P , then we know that it is almost surely a true positive link. If (i, j) /2P ,\nthen we only know that this link was unobserved, and have no information whether or not it is present\nor absent in the true denoised network.\nIn the applications we consider, there is typically a subset of nodes for which all of their incident\nlinks are unobserved. So if we consider a binary adjacency matrix on these nodes based on P, a\nnumber of columns (or rows) will indeed have all missing values. Therefore, the only information we\nhave about these nodes are their incident noisy links in the multi-resolution network.\nThe formulation in (OPT2) can easily incorporate the positive set P. For each node i, we denote\nPi = {j : (i, j) 2P} additional parameters and formulate an extended optimization problem\n\nminimize f (S) \u2327 tr\u2713\u21e3X`\nF 2 Rn\u21e5C,\u21b5\nsubject to F >F = IC,Xj\nSij = 1,X`\n|Pi| Xj2Pi\n\nwith respect to S 2 Rn\u21e5n,\nnXi=1\n\nf (S) =\n\nwhere\n\n1\n\n\u21b5`W`\u2318> S\u25c6 + L (S, F ) + ||S||2\n\nF + P (\u21b5) (OPT6)\n\n2 Rm\n\u21b5` = 1, Sij 0 for all (i, j),\u21b5 ` 0 for all `\n\nSij, L(S, F ) and P (\u21b5) follow from (OPT2).\n\nNotice that when applying alternating optimization to solve this problem, we can simply use the\nsame approach used to solve (OPT2). The only change needed is to include f (S) in the objective of\n(OPT3) in order to update S.\n\n5\n\n\fImplementation Details\n\n3\n3.1 How to Determine C\nWe provide an intuitive way to determine the number of communities, C, in our methods. The\noptimal value of C should be close to the true number of communities in the network. One possible\napproach to discover the number of groups is to analyze the eigenvalues of the weight matrix and\nsearching for a drop in the magnitude of the eigenvalue gaps. However, this approach is very sensitive\nto the noise in the weight matrix therefore can be unstable in a noisy networks. We use an alternative\napproach by analyzing eigenvectors of the network, similar to [27]. Consider a network with C\ndisjoint communities. It is well known that the eigenvectors of the network Laplacian form a full\nbasis spanning the network subspace. Although presence of noise may cause this ideal case to fail, it\ncan still shed light on community membership. Given a speci\ufb01c number of communities C, we aim\nto \ufb01nd an indication matrix Z(R) = XR, where X 2 Rn\u21e5C is the matrix of the top eigenvectors of\nthe network Laplacian, and R 2 RC\u21e5C is a rotation matrix. Denote [M (R)]i = maxj[Z(R)]i,j. We\nsearch for R such that it minimizes the following cost function\n\nJ(R) =Xi,j\n\n[Z(R)]2\ni,j\n[M (R)]2\ni\n\nMinimizing this cost function over all possible rotations will provide the best alignment with the\ncanonical coordinate system. This is done using the gradient descent scheme [27]. Instead of taking\nthe number of communities to be the one providing the minimal cost as in [27], we seek the number\nof communities that result in the largest drop in the value of J(R).\n3.2 Convergence Criterion\nThe proposed method is an iterative algorithm. It is important to determine a convergence criterion\nto stop the iterations. Our method adopts a well-de\ufb01ned approach to decide convergence has been\nreached. Similar to spectral clustering [3], we use eigengap to measure the convergence of our\nmethod. Eigengap is de\ufb01ned as follows:\n\neigengap(i) = i+1 i\n\n(1)\nwhere i is the i-th eigenvalue of the matrix S where we sort the eigenvalues in ascending order\n(1 \uf8ff 2 \uf8ff . . . n). For C clusters, we use eigengap(C) = C+1 C.\nThe intuition behind eigengap is that, if a similarity includes C perfectly strong clusters, then\neigengap(C) should be near zero (which was proved in [28]). Due to the low-rank constraint in our\noptimization framework, we seek a small value of eigengap(C) for a good optimal value. We can set\na stopping criterion for our method using eigengap(C) < T for a small threshold, T . However, due\nto the noise reaching a small threshold cannot be guaranteed, therefore, a practical stopping criterion\nadopted by our method is when eigengap(C) has stoped decreasing. In our experiments we have\nobserved that, eigengap(C) usually decreases for around 10 iterations and then remains stable.\n4 Experiments\nWe apply the framework presented in (OPT6) to Hi-C and Capture-C data. As explained, detecting\ncommunities in these data has important scienti\ufb01c rami\ufb01cations. Our denoising strategy can be part of\nthe pipeline for discovering these communities. We evaluated our methods on real data and checked\ntheir robustness by adding additional noise and measuring performance.\nFor the real data, we started with a ground truth of domains previously identi\ufb01ed in the GM12878\ncell line chromosomes 14 and 21 [10] , \ufb01ltered to only contain domains that do not have ambiguous\nboundaries or that overlap due to noise in the ground truth, and stitched these together. We ran our\nalgorithm using data at 8 different resolutions (5 kb,10kb, 25kb, 50kb, 100kb, 250kb, 500kb, 1Mb).\nA heat map of the highest and lowest resolution of the simulated data for chromosome 21 can be seen\nin Figure 1.\nFigure 2a shows a heat map of the denoised version of chromosome 14 using (OPT6). Below the\nheat map we show the ground truth blocks. 1) The baseline (Louvain algorithm [29]) was set to\nthe clusters determined from the highest resolution Hi-C map (purple) . 2) The clustering improves\nafter denoising this map using (OPT1) (orange). 3) Pooling data through the use of multi-resolution\nmaps and (OPT2) further increases the size of the clusters. Finally 4) using the high con\ufb01dence set,\nmulti-resolution and (OPT6) (blue).\n\n6\n\n\fAs mentioned earlier, in order to determine ground truth, we have chosen large disjoint blocks with\nlow levels of noise. To test our algorithm in the presence of noise, we added distance dependent\nrandom noise to the network. We evaluated our performance by measuring the normalized mutual\ninformation (NMI) between the ground truth and the clusters resulting from the noisy data Figure 2b\n[30]. We see that while the NMI from the baseline falls rapidly the performance of our denoising\nalgorithm stays relatively constant after a rapid (but signi\ufb01cantly smaller) drop. Figure 2c shows the\nweights assigned to each resolution as noise is added. We see that the weight assigned to the highest\nresolution has a steep drop with a small amount of noise. This could partially explain the drop in the\nperformance of baseline (which is computed from the high resolution data) in Figure 2b.\nTo validate the obtained denoised network by our method, we check 2 features of domains: 1)\nincreased covariation in genomic signals such as histone marks inside domains compared to across\ndomains and 2) the binding of the protein CTCF at the boundaries of the domains (see Appendix 7.4).\nWe quantify covariation in genomic signals by focusing on 3 histone marks (H3K4ME1, H3K4ME3\nand H3K27AC), and computing the correlation of these across all pairs of genomic regions, based\non measurements of these histone marks from 75 individuals [33]. We then compare the ratio\nbetween covariants with-in domains and covariants between domains . A higher ratio indicates better\ncoherence of biological signals within domains while larger dispersions of signals between domains,\ntherefore implying better quality of the identi\ufb01ed domains. Second, we inspect another key biological\nphenomena that is binding strength of transcription factor CTCF on boundary regions [31, 35]. It\nhas been observed that, CTCF usually binds with boundary of domains in HiC data. This serves as\nanother way to validate the correctness of identi\ufb01ed domain boundaries, by checking the fraction\nof domain boundaries that contain CTCF. Figure 2d shows that our method produces a higher ratio\nof speci\ufb01c histone marks and CTCF binding than the baseline, indicating better ability to detect\nbiologically meaningful boundaries.\nIn each experiment, we selected the number of communities C for clustering based on the implemen-\ntation details in Section 3. The best C is highlighted in Figure 3a-b. The optimal C coincided with\nthe true number of clusters, indicating that the selection criteria was well-suited for the two datasets.\nFurthermore, as shown in Figure 3c-d the alternating optimization in (OPT6) converged within 20\niterations according to the criteria in Section 3 where the eigen-gaps stabilized quickly.\n\nFigure 1: a) Heat map of data simulated from chromosome 21. The subclusters were chosen to be clearly\ndistinguishable from each other (in order to have clear boundaries to determine the ground truth for the boundaries\nof the blocks). The blocks were subsequently stitched to each other. b) Simulated low resolution data. c) Capture-\nC network: these positions are treated as low-noise data. d) Denoised Network: (OPT6) was used to denoise the\nnetwork using all 8 resolutions in addition to the Capture-C data in (c).\n5 Conclusions and Future Work\nIn this paper we proposed an unsupervised optimization framework to learn meaningful structures\nin a noisy network. We leverage multi-resolution networks to improve the robustness to noise\nby automatically learning weights for different resolutions. In addition, our framework naturally\nextends to incorporate partial labels. We demonstrate the performance of our approach using genomic\ninteraction networks generated by noisy Hi-C data.\nIn particular, we show how incorporating\nmultiple Hi-C resolutions enhances the effectiveness in denoising the interaction networks. Given\npartial information from Capture-C data, we further denoise the network and discover more accurate\ncommunity structures.\nIn the future, it would be important to extend our method to whole genome Hi-C data to get a global\nview of the 3D structure of the genome. This will involve clever binning or partition of the genome to\n\n7\n\n\f(d)$Biology$Valida-on$\n\nBaseline\nOur Method\n\nH3K4ME1\n\nH3K4ME3\n\nH3K27AC\n\nCTCF Binding\n\n2\n1.8\n1.6\n1.4\n1.2\n1\n0.8\n0.6\n0.4\n0.2\n0\n\nFigure 2: a) Denoised Network: heatmap of denoised network using with Hi-C and Capture-C according to\n(OPT6). The tracks below the heatmap indicate the division of the classi\ufb01ed communities with respect to the\nground truth. The use of multi-resolution Hi-C and capture-C achieves the best concordance with the ground\ntruth. b) Clustering performance: The performance of the baseline degrades rapidly with the introduction of\nnoise. Our method with various inputs perform signi\ufb01cantly better than the baseline suggesting that denoising\nusing our framework can signi\ufb01cantly improve the task of clustering c) Weight Distribution: The weights (\u21b5i)\nassigned to each resolution from the optimization in (OPT2). The noise increases the performance of the highest\nresolution matrix decreases rapidly at \ufb01rst. In response, the method rapidly decreases the weight for this matrix.\nd) Ratio between covariates: we used three speci\ufb01c histone marks and the CTCF binding sites as indicators of\nthe accuracy in detecting the boundaries.\n\nFigure 3: a) - b) The gradient of J(R) over the number of communities C. The best C selected is based on the\nvalue that minimizes the gradient of J(R) (circled in red). c) - d) The eigen-gaps over the number of iterations\nin the optimization framework. The eigengaps stabilize at the same value within 20 iterations, indicating the\noptimization problem converges in only a few iterations.\n\nreduce the problem size to a more local level where clustering methods can reveal meaningful local\nstructures. In addition, our current framework is very modular. Even though we demonstrate our\napproach with k-means clustering a module, other appropriate clustering or community detection\nalgorithms can be substituted for this module for whole genome Hi-C analysis. Finally, it would be\ninteresting to extend our approach to a semi-supervised setting where a subset of con\ufb01dent links are\nused to train a classi\ufb01er for the missing links in Capture-C data.\n6 Acknowledgments\nWe would also like to thank Nasa Sinnott-Armstrong for initial advice on this project. JZ acknowl-\nedges support from Stanford Graduate Fellowship. AP was partially supported by Stanford Genome\nTraining Program: NIH 5T32HG000044-17. AK was supported by the Alfred Sloan Foundation\nFellowship. OU is supported by the HHMI International Students Research Fellowship. BW and SB\nwere supported by NIH Sidow grant (1R01CA183904-01A1).\n\n8\n\n\fReferences\n[1] J. Yang, J. McAuley, and J. Leskovec Community detection in networks with node attributes IEEE International Conference on, pp.\n\n1151-1156. (2013).\n\n[2] E. J. Candes and Y. Plan Matrix Completion With Noise. Proceedings of the IEEE 98, 925-936. (2010)\n\n[3] B. Wang, A. M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno, B. Haibe-Kains and A. Goldenberg Similarity network fusion for\n\naggregating data types on a genomic scale. Nat. Methods, 11, 333-337. (2014).\n\n[4] J. Dekker. Gene regulation in the third dimension. Science, 319(5871):1793-4. (2008).\n\n[5] L.A. Lettice, et al. A long-range Shh enhancer regulates expression in the developing limb and \ufb01n and is associated with preaxial\n\npolydactyly. Hum. Mol. Genet. , 12, 1725-1735. (2003).\n\n[6] Y. Qin, L.K. Kong, C. Poirier, C. Truong, P.A. Overbeek and C.E. Bishop Long-range activation of Sox9 in Odd Sex (Ods) mice. Hum.\n\nMol. Genet., 13, 1213-1218. (2004).\n\n[7] W. de Laat, D. Duboule Topology of mammalian developmental enhancers and their regulatory landscapes. Nature, 502, pp. 499-506\n\n(2013).\n\n[8] J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu, B. Ren. Topological domains in mammalian genomes identi\ufb01ed\n\nby analysis of chromatin interactions. Nature, 485:376380 (2012).\n\n[9] E. Lieberman-Aiden et al. Comprehensive Mapping of Long Range Interactions Reveals Folding Principles of the Human Genome.\n\nScience, 326.5950: 289293, (2009).\n\n[10] S. S. P. Rao et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell, 159(7):\n\n1665-1680, (2014).\n\n[11] Shaw P.J. Mapping chromatin conformation. F1000 Biol Rep. 2 doi: 10.3410/B2-18. (2010)\n\n[12] J. Dekker, K. Rippe, M. Dekker and N. Kleckner, Capturing Chromosome Conformation. Science 295, 1306 (2002).\n\n[13] Z. Zhao et al., Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and\n\ninterchromosomal interactions Nat. Genet. 38, 1341 (2006).\n\n[14] M. Simonis et al. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-\n\nchip (4C) Nat. Genet. 38, 1348 (2006).\n\n[15] J. Dostie et al. Chromosome Conformation Capture Carbon Copy (5C): A massively parallel solution for mapping interactions between\n\ngenomic elements Genome Res. 16, 1299 (2006).\n\n[16] J. Chiquet,Y. Grandvalet, and C. Ambroise. Inferring multiple graphical structures Statistics and Computing (2011)\n\n[17] E.M. Marcotte, M. Pellegrini, H.-L. Ng, D.W. Rice, T.O. Yeates, and D. Eisenberg. Detecting protein function and protein-protein\n\ninteractions from genome sequences. Science, 285, pp 751-753, (1999).\n\n[18] J. Chen and B. Yuan. Detecting functional modules in the yeast protein protein interaction network. Bioinformatics, (2006).\n\n[19] J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of Population Structure Using Multilocus Genotype Data. Genetics, (2000).\n\n[20] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. PANS, 99(12):7821-7826, (2002).\n\n[21] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22:888-905, (1997).\n\n[22] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative \ufb01ltering. IEEE Internet Computing,\n\n7(1):76-80, (2003).\n\n[23]\n\nI. Cabreros, E. Abbe, and A. Tsirigos Detecting Community Structures in Hi-C Genomic Data. ArXiv e-prints 1509.05121 (2015).\n\n[24] S. S.P. Rao, et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell 159, 2014.\n\n[25] Wang, Bo, and Tu, Zhuowen Sparse subspace denoising for image manifolds. Computer Vision and Pattern Recognition (CVPR), 2013.\n\n[26] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by low-rank representation. TPAMI, (2013).\n\n[27] L. Zelnik-Manor and P. Perona. In L. K. Saul, Y. Weiss, and L. Bottou, Self-tuning spectral clustering. NIPS, pages 1601-1608. (2005).\n\n[28] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing 17.4 : 395-416 (2007).\n\n[29] V. D. Blondel et al. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008.10 :\n\nP10008 (2008).\n\n[30] S. Alexander, and J. Ghosh. Cluster ensembles\u2014a knowledge reuse framework for combining multiple partitions. JMLR, (2003).\n\n[31] ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature, (2012).\n\n[32] Wang, B., Jiang, J., Wang, W., Zhou, Z. H., and Tu, Z. Unsupervised metric fusion by cross diffusion. Computer Vision and Pattern\n\nRecognition, (2012).\n\n[33] Grubert, Fabian, et al. Genetic control of chromatin states in humans involves local and distal chromosomal interactions. Cell 162.5\n\n(2015): 1051-1065.\n\n[34] Ernst, Jason, and Manolis Kellis. ChromHMM: automating chromatin-state discovery and characterization. Nature methods 9.3 (2012):\n\n215-216.\n\n[35] Mifsud, Borbala, et al. Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. Nature genetics 47.6\n\n(2015): 598-606.\n\n9\n\n\f", "award": [], "sourceid": 1646, "authors": [{"given_name": "Bo", "family_name": "Wang", "institution": "Stanford University"}, {"given_name": "Junjie", "family_name": "Zhu", "institution": "Stanford University"}, {"given_name": "Armin", "family_name": "Pourshafeie", "institution": "Stanford University"}, {"given_name": "Oana", "family_name": "Ursu", "institution": "Dept. of Genetics"}, {"given_name": "Serafim", "family_name": "Batzoglou", "institution": "Dept. of Computer Science"}, {"given_name": "Anshul", "family_name": "Kundaje", "institution": "Stanford University"}]}