{"title": "Supervised Graph Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1433, "page_last": 1440, "abstract": null, "full_text": " Supervised graph inference\n\n\n\n Jean-Philippe Vert Yoshihiro Yamanishi\n Centre de Geostatistique Bioinformatics Center\n Ecole des Mines de Paris Institute for Chemical Research\n 35 rue Saint-Honore Kyoto University\n 77300 Fontainebleau, France Uji, Kyoto 611-0011, Japan\n Jean-Philippe.Vert@mines.org yoshi@kuicr.kyoto-u.ac.jp\n\n\n\n\n Abstract\n\n We formulate the problem of graph inference where part of the graph is\n known as a supervised learning problem, and propose an algorithm to\n solve it. The method involves the learning of a mapping of the vertices\n to a Euclidean space where the graph is easy to infer, and can be formu-\n lated as an optimization problem in a reproducing kernel Hilbert space.\n We report encouraging results on the problem of metabolic network re-\n construction from genomic data.\n\n\n\n1 Introduction\n\nThe problem of graph inference, or graph reconstruction, is to predict the presence or ab-\nsence of edges between a set of points known to form the vertices of a graph, the prediction\nbeing based on observations about the points. This problem has recently drawn a lot of at-\ntention in computational biology, where the reconstruction of various biological networks,\nsuch as gene or molecular networks from genomic data, is a core prerequisite to the re-\ncent field of systems biology that aims at investigating the structures and properties of such\nnetworks. As an example, the in silico reconstruction of protein interaction networks [1],\ngene regulatory networks [2] or metabolic networks [3] from large-scale data generated by\nhigh-throughput technologies, including genome sequencing or microarrays, is one of the\nmain challenges of current systems biology.\n\nVarious approaches have been proposed to solve the network inference problem. Bayesian\n[2] or Petri networks [4] are popular frameworks to model the gene regulatory or the\nmetabolic network, and include methods to infer the network from data such as gene ex-\npression of metabolite concentrations [2]. In other cases, such as inferring protein inter-\nactions from gene sequences or gene expression, these models are less relevant and more\ndirect approaches involving the prediction of edges between \"similar\" nodes have been\ntested [5, 6].\n\nThese approaches are unsupervised, in the sense that they base their prediction on prior\nknowledge about which edges should be present for a given set of points; this prior knowl-\nedge might for example be based on a model of conditional independence in the case of\nBayesian networks, or on the assumption that edges should connect similar points. The\nactual situations we are confronted with, however, can often be expressed in a supervised\nframework: besides the data about the vertices, part of the network is already known. This\n\n\f\nis obviously the case with all network examples discussed above, and the real challenge\nis to denoise the observed subgraph, if errors are assumed to be present, and to infer new\nedges involving in particular nodes outside of the observed subgraph. In order to clarify\nthis point, let us take the example of an actual network inference problem that we treat\nin the experiment below: the inference of the metabolic network from various genomic\ndata. The metabolic network is a graph of genes that involves only a subset of all the\ngenes of an organisms, known as enzymes. Enzymes can catalyze chemical reaction, and\nan edge between two enzymes indicates that they can catalyze two successive reactions.\nFor most organisms, this graph is partially known, because many enzymes have already\nbeen characterized. However many enzymes are also missing, and the problem is to detect\nuncharacterized enzymes and place them in their correct location in the metabolic network.\nMathematically speaking, this means adding new edges involving new points, and eventu-\nally modifying edges in the known graph to remove mistakes from our current knowledge.\n\nIn this contribution we propose an algorithm for supervised graph inference, i.e., to infer\na graph from observations about the vertices and from the knowledge of part of the graph.\nSeveral attempts have already been made to formalize the network inference problem as a\nsupervised machine learning problem [1, 7], but these attempts consist in predicting each\nedge independently from each others using algorithms for supervised classification. We\npropose below a radically different setting, where the known subgraph is used to extract a\nnew representation for the vertices, as points in a vector space, where the structure of the\ngraph is easier to infer than from the original observations. The edge inference engine in the\nvector space is very simple (edges are inferred between nodes with similar representations),\nand the learning step is limited to the construction of the mapping of the nodes onto the\nvector space.\n\n\n2 The supervised graph inference problem\n\nLet us formally define the supervised graph inference problem. We suppose an undirected\nsimple graph G = (V, E) is given, where V = (v1, . . . , vn) Vn is a set of vertices and\nE V V is a set of edges. The problem is, given an additional set of vertices V =\n(v , . . . , v ) Vm\n 1 m , to infer a set of edges E V (V V ) (V V ) V involving\nthe nodes in V . In many situations of interest, in particular gene networks, the additional\nnodes V might be known in advance, but we do not make this assumption here to ensure\na level of generality as large as possible. For the applications we have in mind, the vertices\ncan be represented in V by a variety of data types, including but not limited to biological\nsequences, molecular structures, expression profiles or metabolite concentrations. In order\nto allow this diversity and take advantage of recent works on positive definite kernels on\ngeneral sets [8], we will assume that V is a set endowed with a positive definite kernel k,\nthat is, a symmetric function k : V2 R satisfying p a\n i,j=1 iaj k(xi, xj ) 0 for any\np p\n N, (a1, . . . , an) R and (x1, . . . , xp) V p.\n\n\n3 From distance learning to graph inference\n\nSuppose first that a graph must be inferred on p points (x1, . . . , xp) in the Euclidean space\n d\nR , without further information than \"similar points\" should be connected. Then the sim-\nplest strategy to predict edges between the points is to put an edge between vertices that\nare at a distance from each other smaller than a fixed threshold . More or less edges can\nbe inferred by varying the threshold. We call this strategy the \"direct\" strategy. We now\npropose to cast the supervised graph inference problem in a two step procedure:\n\n map the original points to a Euclidean space through a mapping f : V d\n R ;\n apply the direct strategy to infer the network on the points {f (v), v V V } .\n\n\f\nWhile the second part of this procedure is fixed, the first part can be optimized by super-\nvised learning of f using the known network. To do so we require the mapping f to map\nadjacent vertices in the known graph to nearby positions in d\n R , in order to ensure that the\nknown graph can be recovered to some extent by the direct strategy. Stated this way, the\nproblem of learning f appears similar to a problem of distance learning that has been raised\nin the context of clustering [9], a important difference being that we need to define a new\nrepresentation of the points and therefore a new (Euclidean) distance not only for the points\nin the training set, but also for points unknown during training.\n\nGiven a function f : V R, a possible criterion to assess whether connected (resp. dis-\nconnected) vertices are mapped onto similar (resp. dissimilar) points in R is the following:\n\n (f (u) - f (v))2 - (f (u) - f (v))2\n (u,v)E (u,v)E\n R(f ) = . (1)\n (f (u) - f (v))2\n (u,v)V 2\n\n\nA small value of R(f ) ensures that connected vertices tend to be closer than disconnected\nvertices (in a quadratic error sense). Observe that the numerator ensures an invariance of\nR(f ) with respect to a scaling of f by a constant, which is consistent with the fact that the\ndirect strategy itself is invariant with respect to scaling of the points.\n\nLet us denote by f n\n V = (f (v1), . . . , f (vn)) R the values taken by f on the training\nset, and by L the combinatorial Laplacian of the graph G, i.e., the n n matrix where Li,j\nis equal to -1 (resp. 0) if i = j and vertices vi and vj are connected (resp. disconnected),\nand Li,i = - L f (v) = 0), then the\n j=i i,j . If we restrict fV to have zero mean ( vV\ncriterion (1) can be rewritten as follows:\n\n f LfV\n R(f ) = 4 V - 2.\n f f\n V V\n\n\nThe obvious minimum of R(f ) under the constraint f (v) = 0 is reached for any\n vV\nfunction f such that fV is equal to the second largest eigenvector of L (the largest eigen-\nvector of L begin the constant vector). However, this only defines the values of f on the\npoints V , but leaves indeterminacy on the values of f outside of V . Moreover, any arbi-\ntrary choice of f under a single constraint on fV is likely to be a mapping that overfits\nthe known graph at the expense of the capacity to infer the unknown edges. To overcome\nboth issues, we propose to regularize the criterion (1), by a smoothness functional on f ,\na classical approach in statistical learning [10, 11]. A convenient setting is to assume that\nf belongs to the reproducing kernel Hilbert space (r.k.h.s.) H defined by the kernel k on\nV, and to use the norm of f in the r.k.h.s. as a regularization operator. The regularized\ncriterion to be minimized becomes:\n\n f LfV + ||f ||2\n min V H , (2)\n f H0 f f\n V V\n\nwhere H0 = {f H : f (v) = 0} is the subset of H orthogonal to the function\n vV\nx k(x, v) in H and is a regularization parameter.\n vV\n\nWe note that [12] have recently and indenpendently proposed a similar formulation in the\ncontext of clustering. The regularization parameter controls the trade-off between minimiz-\ning the original criterion (1) and ensuring that the solution has a small norm in the r.k.h.s.\nWhen varies, the solution to (2) varies between to extremes:\n\n When is small, fV tends to the second largest eigenvector of the Laplacian L.\n The regularization ensures that f is well defined as a function of V R, but f is\n likely to overfit the known graph.\n\n\f\n When is large, the solution to (2) converges to the first kernel principal compo-\n nent (up to a scaling) [13], whatever the graph. Even though no supervised learn-\n ing is performed in this case, one can observe that the resulting transformation,\n when the first d kernel principal components are kept, is similar to the operation\n performed in spectral clustering [14, 15] where points are mapped onto the first\n few eigenvectors of a similarity matrix before being clustered.\n\nBefore showing how (2) is solved in practice, we must complete the picture by explaining\nhow the mapping f : V d\n R is obtained. First note that the criterion in (2) is defined up\nto a scaling of the functions, and the solution is therefore a direction in the r.k.h.s. In order\nto extract a function, an additional constraint must be set, such that imposing the norm\n||f ||H = 1, or imposing f (v)2 = 1. The first solution correspond to an orthogonal\n V vV\nprojection onto the direction selected in the r.k.h.s. (which would for example give the same\nresult as kernel PCA for large ), while the second solution would provide a sphering of the\ndata. We tested both possibilities in practice and found very little difference, with however\nslightly better results for the first solution (imposing ||f ||H = 1). Second, the problem (2)\n V\nonly defines a one-dimensional feature. In order to get a d-dimensional representation of\nthe vertices, we propose to iterate the minimization of (2) under orthogonality constraints\nin the r.k.h.s., that is, we recursively define the i-th feature fi for i = 1, . . . , d by:\n\n f LfV + ||f ||2\n f V H\n i = arg min . (3)\n f H f f\n 0 ,f {f1 ,...,fi-1 } V V\n\n\n\n4 Implementation\n\nLet kV be the kernel obtained by centering k on the set V , i.e.,\n\n 1 1 1\n kV (x, y) = k(x, y) - k(x, v) - k(y, v) + k(v, v ),\n n n n2\n vV vV (v,v )V 2\n\nand let HV be the r.k.h.s. associated with kV . Then it can easily be checked that HV = H0,\nwhere H0 is defined in the previous section as the subset of H of the function with zero\nmean on V . A simple extensions of the representer theorem [10] in the r.k.h.s. HV shows\nthat for any i = 1, . . . , d, the solution to (3) has an expansion of the form:\n\n n\n\n fi(x) = i,jkV (xj, x),\n j=1\n\n\nfor some vector n\n i = (i,1, . . . , i,n) R . The corresponding vector fi,V can be\nwritten in terms of i by fi,V = KV i, where KV is the Gram matrix of the kernel kV\non the set V , i.e., [KV ]i,j = kV (vi, vj) for i, j = 1, . . . , n. KV is obtained from the Gram\nmatrix K of the original kernel k by the classical formula KV = (I - U )K(I - U ), I\nbeing the n n identity matrix and U being the constant n n matrix [U ]i,j = 1/n for\ni, j = 1, . . . , n [13]. Besides, the norm in HV is equal to ||fi||2 = K\n H V i, and the\n V i\northogonality constraint between fi and fj in HV translates into K\n i V j = 0. As a\nresult, the problem (2) is equivalent to the following:\n\n KV LKV + KV \n i = arg min . (4)\n K2 \n Rn,KV 1=...=KV i-1=0 V\n\nTaking the differential of (4) with respect to to 0 we see that the first vector 1 must solve\nthe following generalized eigenvector problem with the smallest (non-negative) generalized\neigenvalue:\n (KV LKV + KV ) = K2 .\n V (5)\n\n\f\nThis shows that 1 must solve the following problem:\n\n (LKV + I) = KV , (6)\n\nup to the addition of a vector satisfying K = 0. Hence any solution of (5) differs from\na solution of (6) by such an , which however does not change the corresponding function\nf HV . It is therefore enough to solve (6) in order to find the first vector 1. K being\npositive semidefinite, the other generalized eigenvectors of (6) are conjugate with respect\nto KV , so it can easily be checked that the d vectors 1, . . . , d solving (4) are in fact the d\nsmallest generalized eigenvectors or (6). In practice, for large n, the generalized eigenvec-\ntor problem (6) can be solved by first performing an incomplete Cholesky decomposition\nof KV , see e.g. [16].\n\n\n\n -4 -4\n\n -2 -2\n\n 0 0\n\n 2 2\n\n 4 4\n\n 6 6\n\n Regularization parameter (log2) 8 Regularization parameter (log2) 8\n 0 20 40 60 80 100 0 20 40 60 80 100\n Number of features Number of features\n\n\n (a) Train vs train (b) Test vs (Train + test)\n\n\n\n\n -4\n\n -2\n\n 0\n\n 2\n\n 4\n\n 6\n\n Regularization parameter (log2) 8\n 0 20 40 60 80 100\n Number of features\n\n\n (c) Test vs test\n\n\n\nFigure 1: ROC score for different numbers of features and regularization parameters, in a\n5-fold cross-validation experiment with the integrated kernel (the color scale is adjusted to\nhighlight the variations inside each figure, the performance increases from blue to red).\n\n\n\n5 Experiment\n\nWe tested the supervised graph inference method described in the previous section on the\nproblem of inferring a gene network of interest in computational biology: the metabolic\ngene network, with enzymes present in an organism as vertices, and edges between two en-\nzymes when they can catalyze successive chemical reactions [17]. Focusing on the budding\n\n\f\nyeast S. cerevisiae, the network corresponding to our current knowledge of the network was\nextracted from the KEGG database [18]. The resulting network contains 769 vertices and\n7404 edges. In order to infer it, various independent data about the genes can be used.\nWe focus on three sources of data, likely to contain useful information to infer the graph:\na set of 157 gene expression measurement obtained from DNA microarrays [19, 20], the\nphylogenetic profiles of the genes [21] as vectors of 145 bits indicating the presence or\nabsence of each gene in 145 fully sequenced genomes, and their localization in the cell\ndetermined experimentally [22] as vectors of 23 bits indicating the presence of each gene\ninto each of 23 compartment of the cell. In each case a Gaussian RBF kernel was used to\nrepresent the data as a kernel matrix. We denote these three datasets as \"exp\", \"phy\" and\n\"loc\" below. Additionally, we considered a fourth kernel obtained by summing the first\nthree kernels. This is a simple approach to data integration that has proved to be useful in\n[23], for example. This integrated kernel is denoted \"int\" below.\n\nWe performed 5-fold cross-validation experiments as follows. For each random split of\nthe set of genes into 80% (training set) and 20% (test set), the features are learned from\nthe subgraph with genes from the training set as vertices. The edges involving genes in\nthe test set are then predicted among all possible interactions involving the test set. The\nperformance of the inference is estimated in term of ROC curves (the plot of the percentage\nof actual edges predicted as a function of the number of edges predicted although they are\nnot present), and in terms of the area under the ROC curve normalized between 0 and 1.\nNotice that the set of possible interactions to be predicted is made of interactions between\ntwo genes in the test set, on the one hand, and between one gene in the test set and one gene\nin the training set, on the other hand. As it might be more challenging to infer an edge in\nthe former case, we compute two performances: first on the edges involving two nodes in\nthe test set, and second on the edges involving at least one vertex in the test set.\n\nThe algorithm contains 2 free parameters: the number d of features to be kept, and the\nregularization parameter that prevents from overfitting the known graph. We varied \namong the values 2i, for i = -5, . . . , 8, and d between 1 and 100. Figure 1 displays the\nperformance in terms of ROC index for the graph inference with the integrated kernel, for\ndifferent values of d and . On the training set, it can be seen that the effect of increasing\n constantly decreases the performance of the graph reconstruction, which is natural since\nsmaller values of are expected to overfit the training graph. These results however justify\nthat the criterion (1), although not directly related to the ROC index of the graph recon-\nstruction procedure, is a useful criterion to be optimized. As an example, for very small\nvalues of , the ROC index on the training set is above 96%. The results on the test vs. test\nand on the test vs. (train + test) experiments show that overfitting indeed occurs for small\n values, and that there is an optimum, both in terms of d and . The slight difference\nbetween the performance landscapes in the experiments \"test vs. test\" and \"test vs. (train\n+ test)\" show that the first one is indeed more difficult that the latter one, where some form\nof overfitting is likely to occur (in the mapping of the vertices in the training set). In par-\nticular the \"test vs. test\" seems to be more sensitive to the number of features selected that\nthe other setting. The abolute values of the ROC scures when 20 features are selected, for\nvarying , are shown in figure 2. For all kernels tested, overfitting occurs at small values,\nand an optimum exists (around = 2 10). The performance in the setting \"test vs.\n(train+test)\" is consistently better than that in the setting \"test vs. test\". Finally, and more\ninterestingly, the inference with the integrated kernel outperforms the inference with each\nindividual kernel. This is further highlighted in figure 3, where the ROC curves obtained\nfor 20 features and = 2 are shown.\n\n\nReferences\n\n [1] R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N.J. Krogan, S. Chung, A. Emili, M. Snyder,\n J.F. Greenblatt, and M. Gerstein. A bayesian networks approach for predicting protein-protein\n\n\f\n 1 1\n\n\n 0.9 0.9\n\n\n 0.8 0.8\n\n\n 0.7 0.7\n ROC index ROC index\n\n 0.6 0.6\n\n\n 0.5 0.5\n -4 -2 0 2 4 6 8 -4 -2 0 2 4 6 8\n Regularization parameter (log2) Regularization parameter (log2)\n\n\n (a) Expression kernel (b) Localization kernel)\n\n\n\n 1 1\n\n\n 0.9 0.9\n\n\n 0.8 0.8\n\n\n 0.7 0.7\n ROC index ROC index\n\n 0.6 0.6\n\n\n 0.5 0.5\n -4 -2 0 2 4 6 8 -4 -2 0 2 4 6 8\n Regularization parameter (log2) Regularization parameter (log2)\n\n\n (c) Phylogenetic kernel (d) Integrated kernel\n\n\n\nFigure 2: ROC scores for different regularization parameters when 20 features are selected.\nDifferent pictures represent different kernels. In each picture, the dashed blue line, dash-\ndot red line and continuous black line correspond respectively to the ROC index on the\ntraining vs training set, the test vs (training + test) set, and the test vs test set.\n\n\n\n 100 100\n\n\n 80 80\n\n\n 60 60\n\n\n 40 40\n Kexp Kexp\n Kloc\n True Positives (%) Kloc True Positives (%)\n 20 Kphy 20 Kphy\n Kint Kint\n Krand Krand\n 0 0\n 0 20 40 60 80 100 0 20 40 60 80 100\n False positives (%) False positives (%)\n\n\n (a) Test vs. (train+test) (b) Test vs. test)\n\n\n\n Figure 3: ROC with 20 features selected and = 2 for the various kernels.\n\n\f\n interactions from genomic data. Science, 302(5644), 2003.\n\n [2] N. Friedman, M. Linial, I. Nachman, and D. Pe'er. Using bayesian networks to analyze expres-\n sion data. Journal of Computational Biology, 7:601620, 2000.\n\n [3] M. Kanehisa. Prediction of higher order functional networks from genomic data. Pharmacoge-\n nomics, 2(4):373385, 2001.\n\n [4] A. Doi, H. Matsuno, M. Nagasaki, and S. Miyano. Hybrid petri net representation of gene\n regulatory network. In Proceedings of PSB 5, pages 341352, 2000.\n\n [5] E.M. Marcotte, M. Pellegrini, H.-L. Ng, D.W. Rice, T.O. Yeates, and D. Eisenberg. De-\n tecting protein function and protein-protein interactions from genome sequences. Science,\n 285(5428):751753, 1999.\n\n [6] F. Pazos and A. Valencia. Similarity of phylogenetic trees as indicator of protein?protein inter-\n action. Protein Engineering, 9(14):609614, 2001.\n\n [7] J. R. Bock and D. A. Gough. Predicting protein-protein interactions from primary structure.\n Bioinformatics, 17:455460, 2001.\n\n [8] B. Schrolkopf, K. Tsuda, and J.-P. Vert. Kernel methods in computational biology. MIT Press,\n 2004.\n\n [9] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning with application to\n clustering with side-information. In NIPS 15, pages 505512. MIT Press, 2003.\n\n[10] G. Wahba. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59,\n SIAM, Philadelphia, 1990.\n\n[11] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures.\n Neural Computation, 7(2):219269, 1995.\n\n[12] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for\n learning from examples. Technical Report TR-2004-06, University of Chicago, 2004.\n\n[13] B. Scholkopf, A. J. Smola, and K.-R. Muller. Kernel principal component analysis. In\n B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector\n Learning, pages 327352. MIT Press, 1999.\n\n[14] Y. Weiss. Segmentation using eigenvectors: a unifying view. In Proceedings of the IEEE\n International Conference on Computer Vision, pages 975982. IEEE Computer Society, 1999.\n\n[15] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In\n NIPS 14, pages 849856, MIT Press, 2002.\n\n[16] F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine\n Learning Research, 3:148, 2002.\n\n[17] J.-P. Vert and M. Kanehisa. Graph-driven features extraction from microarray data using diffu-\n sion kernels and kernel CCA. In NIPS 15. MIT Press, 2003.\n\n[18] M. Kanehisa, S. Goto, S. Kawashima, and A. Nakaya. The KEGG databases at genomenet.\n Nucleic Acids Research, 30:4246, 2002.\n\n[19] P. T. Spellman, G. Sherlock, M. Q. Zhang, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein,\n and B. Futcher. Comprehensive identification of cell cycle-regulated genes of the yeast Saccha-\n romyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9:32733297, 1998.\n\n[20] M. Eisen, P. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-\n wide expression patterns. PNAS, 95:1486314868, 1998.\n\n[21] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates. Assign-\n ing protein functions by comparative genome analysis: protein phylogenetic profiles. PNAS,\n 96(8):42854288, 1999.\n\n[22] W.K. Huh, J.V. Falco, C. Gerke, A.S. Carroll, R.W. Howson, J.S. Weissman, and E.K. O'Shea.\n Global analysis of protein localization in budding yeast. Nature, 425:686691, 2003.\n\n[23] Y. Yamanishi, J.-P. Vert, A. Nakaya, and M. Kanehisa. Extraction of correlated gene clusters\n from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformat-\n ics, 19:i323i330, 2003.\n\n\f\n", "award": [], "sourceid": 2738, "authors": [{"given_name": "Jean-philippe", "family_name": "Vert", "institution": null}, {"given_name": "Yoshihiro", "family_name": "Yamanishi", "institution": null}]}