{"title": "Gaussian Process Models for Link Analysis and Transfer Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1657, "page_last": 1664, "abstract": "In this paper we develop a Gaussian process (GP) framework to model a collection of reciprocal random variables defined on the \\emph{edges} of a network. We show how to construct GP priors, i.e.,~covariance functions, on the edges of directed, undirected, and bipartite graphs. The model suggests an intimate connection between \\emph{link prediction} and \\emph{transfer learning}, which were traditionally considered two separate research topics. Though a straightforward GP inference has a very high complexity, we develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity.", "full_text": "Gaussian Process Models for\n\nLink Analysis and Transfer Learning\n\nKai Yu\n\nNEC Laboratories America\n\nCupertino, CA 95014\n\nWei Chu\n\nColumbia University, CCLS\n\nNew York, NY 10115\n\nAbstract\n\nThis paper aims to model relational data on edges of networks. We describe appro-\npriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks.\nThe inter-dependencies of edges can be effectively modeled by adapting the GP\nhyper-parameters. The framework suggests an intimate connection between link\nprediction and transfer learning, which were traditionally two separate research\ntopics. We develop an ef\ufb01cient learning algorithm that can handle a large number\nof observations. The experimental results on several real-world data sets verify\nsuperior learning capacity.\n\n1 Introduction\n\nIn many scenarios the data of interest consist of relational observations on the edges of networks.\nTypically, a given \ufb01nite collection of such relational data can be represented as an M \u00d7 N matrix\nY = {yi,j}, which is often partially observed because many elements are missing. Sometimes\naccompanying Y are attributes of nodes or edges. As an important nature of networks, {yi,j} are\nhighly inter-dependent even conditioned on known node or edge attributes. The phenomenon is\nextremely common in real-world data, for example,\n\n\u2022 Bipartite Graphs. The data represent relations between two different sets of objects or\nmeasurements under a pair of heterogeneous conditions. One notable example is transfer\nlearning, also known as multi-task learning, which jointly learns multiple related but dif-\nferent predictive functions based on the M \u00d7 N observed labels Y, namely, the results of\nN functions acting on a set of M data examples. Collaborative \ufb01ltering is an important\napplication of transfer learning that learns many users\u2019 interests on a large set of items.\n\u2022 Undirected and Directed Graphs. The data are measurements of existences, strengths, and\ntypes of links between a set of nodes in a graph, where a given collection of observations\nare an M \u00d7M (in this case N = M) matrix Y, which can be symmetric or asymmetric, de-\npending on whether the links are undirected or directed. Examples include protein-protein\ninteractions, social networks, citation networks, and hyperlinks on the WEB. Link predic-\ntion aims to recover those missing measurements in Y, for example, predicting unknown\nprotein-protein interactions based on known interactions.\n\nThe goal of this paper is to design a Gaussian process (GP) [13] framework to model the depen-\ndence structure of networks, and to contribute an ef\ufb01cient algorithm to learn and predict large-scale\nrelational data. We explicitly construct a series of parametric models indexed by their dimension-\nality, and show that in the limit we obtain nonparametric GP priors consistent with the dependence\nof edge-wise measurements. Since the kernel matrix is on a quadratic number of edges and the\ncomputation cost is even cubic of the kernel size, we develop an ef\ufb01cient algorithm to reduce the\ncomputational complexity. We also demonstrate that transfer learning has an intimate connection to\nlink prediction. Our method generalizes several recent transfer learning algorithms by additionally\nlearning a task-speci\ufb01c kernel that directly expresses the dependence between tasks.\n\n1\n\n\fThe application of GPs to learning on networks or graphs has been fairly recent. Most of the work\nin this direction has focused on GPs over nodes of graphs and targeted at the classi\ufb01cation of nodes\n[20, 6, 10]. In this paper, we regard the edges as the \ufb01rst-class citizen and develop a general GP\nframework for modeling the dependence of edge-wise observations on bipartite, undirected and\ndirected graphs. This work extends [19], which built GPs for only bipartite graphs and proposed\nan algorithm scaling cubically to the number of nodes. In contrast, the work here is more general\nand the algorithm scales linearly to the number of edges. Our study promises a careful treatment to\nmodel the nature of edge-wise observations and offers a promising tool for link prediction.\n\n2 Gaussian Processes for Network Data\n\n2.1 Modeling Bipartite Graphs\n\nWe \ufb01rst review the edge-wise GP for bipartite graphs [19], where each observation is a measurement\non a pair of objects of different types, or under a pair of heterogenous conditions. Formally, let U\nand V be two index sets, then yi,j denotes a measurement on edge (i, j) with i \u2208 U and j \u2208 V.\nIn the context of transfer learning, the pair involves a data instance i and a task j, and yi,j denotes\nthe label of data i within task j. The probabilistic model assumes that yi,j are noisy outcomes of a\nreal-valued function f : U \u00d7V \u2192 R, which follows a Gaussian process GP(b, K), characterized by\nmean function b and covariance (kernel) function between edges\n\nK ((i, j), (i(cid:48), j(cid:48))) = \u03a3(i, i(cid:48))\u2126(j, j(cid:48))\n\n(1)\nwhere \u03a3 and \u2126 are kernel functions on U and V, respectively. As a result, the realizations of f on a\n\ufb01nite set i = 1, . . . , M and j = 1, . . . , N form a matrix F, following a matrix-variate normal dis-\ntribution NM\u00d7N (B, \u03a3, \u2126), or equivalently a normal distribution N (b, K) with mean b = vec(B)\nand covariance K = \u2126\u2297\u03a3, where \u2297 means Kronecker product. The dependence structure of edges\nis decomposed into the dependence of nodes. Since a kernel is a notion of similarity, the model ex-\npresses a prior belief \u2013 if node i is similar to node i(cid:48) and node j is similar node j(cid:48), then so are f(i, j)\nand f(i(cid:48), j(cid:48)).\nIt is essential to learn the kernels \u03a3 and \u2126 based on the partially observed Y, in order to capture the\ndependence structure of the network. For transfer learning, this means to learn the kernel \u03a3 between\ndata instances and the kernel \u2126 between tasks. Having \u03a3 and \u2126 is it then possible to predict those\nmissing yi,j based on known observations by using GP inference.\n\nTheorem 2.1 ([19]). Let f(i, j) = D\u22121/2(cid:80)D\n\niid\u223c GP(0, \u03a3) and\niid\u223c GP(0, \u2126), then f \u223c GP(b, K) in the limit D \u2192 \u221e, and the covariance between pairs is\n\nk=1 gk(i)hk(j) + b(i, j), where gk\n\nhk\nK ((i, j), (i(cid:48), j(cid:48))) = \u03a3(i, i(cid:48))\u2126(j, j(cid:48)).\n\nTheorem (2.1) offers an alternative view to understand the model. The edge-wise function f can be\ndecomposed into a product of two sets of intermediate node-wise functions, {gk}\u221e\nk=1 and {hk}\u221e\nk=1,\nwhich are i.i.d. samples from two GP priors GP(0, \u03a3) and GP(0, \u2126). The theorem suggests that the\nGP model for bipartite relational data is a generalization of a Bayesian low-rank matrix factorization\nF = HG(cid:62) + B, under the prior H \u223c NM\u00d7D(0, \u03a3, I) and G \u223c NN\u00d7D(0, \u2126, I). When D is \ufb01nite,\nthe elements of F are not Gaussian random variables.\n\n2.2 Modeling Directed and Undirected Graphs\nIn this section we model observations on pairs of nodes of the same set U. This case includes both\ndirected and undirected graphs. It turns out that the directed graph is relatively easy to handle while\nderiving a GP prior for undirected graphs is slightly non-trivial. For the case of directed graphs, we\nlet the function f : U \u00d7 U \u2192 R follow GP(b, K), where the covariance function between edges is\n(2)\nand C : U \u00d7 U \u2192 R is a kernel function between nodes. Since a random function f drawn from the\nGP is generally asymmetric (even if b is symmetric), namely f(i, j) (cid:54)= f(j, i), the direction of edges\ncan be modeled. The covariance function Eq. (2) can be derived from Theorem (2.1) by setting\nthat {gk} and {hk} are two independent sets of functions i.i.d. sampled from the same GP prior\n\nK ((i, j), (i(cid:48), j(cid:48))) = C(i, i(cid:48))C(j, j(cid:48))\n\n2\n\n\fGP(0, C), modeling the situation that each node\u2019s behavior as a sender is different but statistically\nrelated to it\u2019s behavior as a receiver. This is a reasonable modeling assumption. For example, if two\npapers cite a common set of papers, their are also likely to be cited by a common set of other papers.\n\nFor the case of undirected graphs, we need to design a GP that ensures any sampled function to\nbe symmetric. Following the construction of GP in Theorem (2.1), it seems that f is symmetric if\ngk \u2261 hk for k = 1, . . . , D. However a calculation reveals that f is not bounded in the limit D \u2192 \u221e.\nTheorem (2.2) shows that the problem can be solved by subtracting a growing quantity D1/2C(i, j)\nas D \u2192 \u221e, and suggests the covariance function\n\nK ((i, j), (i(cid:48), j(cid:48))) = C(i, i(cid:48))C(j, j(cid:48)) + C(i, j(cid:48))C(j, i(cid:48)).\n\n(3)\nWith such covariance function , f is ensured to be symmetric because the covariance between f(i, j)\nand f(j, i) equals the variance of either.\n\nTheorem 2.2. Let f(i, j) = D\u22121/2(cid:80)D\n\nk=1 tk(i)tk(j)+b(i, j)\u2212D1/2C(i, j), where tk\n\niid\u223c GP(0, C),\nthen f \u223c GP(b, K) in the limit D \u2192 \u221e, and the covariance between pairs is K ((i, j), (i(cid:48), j(cid:48))) =\nC(i, i(cid:48))C(j, j(cid:48)) + C(i, j(cid:48))C(j, i(cid:48)). If b(i, j) = b(j, i), then f(i, j) = f(j, i).\nProof. Without loss of generality, let b(i, j) \u2261 0. Based on the central limit theorem, for every (i, j),\n(cid:80)D\nf(i, j) converges to a zero-mean Gaussian random variable as D \u2192 \u221e, because {tk(i)tk(j)}D\nk=1\nis a collection of random variables independently following the same distribution, and has the mean\nk=1{E[tk(i)tk(j)tk(i(cid:48))tk(j(cid:48))] \u2212\nC(i, j). The covariance function is Cov(f(i, j), f(i(cid:48), j(cid:48))) = 1\nC(i, j)E[tk(i(cid:48))tk(j(cid:48))] \u2212 C(i(cid:48), j(cid:48))E[tk(i)tk(j)] + C(i, j)C(i(cid:48), j(cid:48))} = C(i, i(cid:48))C(j, j(cid:48)) +\nC(i, j(cid:48))C(j, i(cid:48)) + C(i, j)C(i(cid:48), j(cid:48)) \u2212 C(i, j)C(i(cid:48), j(cid:48)) = C(i, i(cid:48))C(j, j(cid:48)) + C(i, j(cid:48))C(j, i(cid:48)).\n\nD\n\nInterestingly, Theorem (2.2) recovers Theorem (2.1) and is thus more general. To see the connection,\nlet hk \u223c GP(0, \u03a3) and gk \u223c GP(0, \u2126) be concatenated to form a function tk, then we have\ntk \u223c GP(0, C) and the covariance is\n\n\uf8f1\uf8f2\uf8f3\u03a3(i, j),\n\n\u2126(i, j),\n0,\n\nC(i, j) =\n\nif i, j \u2208 U,\nif i, j \u2208 V,\nif i, j are in different sets.\n\nD(cid:88)\n\n(4)\n\n(5)\n\n(6)\n\nFor i, i(cid:48) \u2208 U and j, j(cid:48) \u2208 V, applying Theorem (2.2) leads to\n\nD(cid:88)\n\nf(i, j) = D\u22121/2\nK ((i, j), (i(cid:48), j(cid:48))) = C(i, i(cid:48))C(j, j(cid:48)) + C(i, j(cid:48))C(j, i(cid:48)) = \u03a3(i, i(cid:48))\u2126(j, j(cid:48)).\n\ntk(i)tk(j) + b(i, j) \u2212 D1/2C(i, j) = D\u22121/2\n\nk=1\n\nk=1\n\nhk(i)gk(j) + b(i, j),\n\nTheorems (2.1) and (2.2) suggest a general GP framework to model directed or undirected relation-\nships connecting heterogeneous types of nodes. Basically, we learn node-wise covariance functions,\nlike \u03a3, \u2126, and C, such that edge-wise covariances composed by Eq. (1), (2), or (3) can explain the\nhappening of observations yi,j on edges. The proposed framework can be extended to cope with\nmore complex network data, for example, networks containing both undirected links and directed\nlinks. We will brie\ufb02y discuss some extensions in Sec. 6.\n\n3 An Ef\ufb01cient Learning Algorithm\n\nWe consider the regression case under a Gaussian noise model, and later brie\ufb02y discuss extensions\nto the classi\ufb01cation case. Let y = [yi,j](i,j)\u2208O be the observational vector of length |O|, f be the\ncorresponding quantities of the latent function f, and K be the |O|\u00d7|O| matrix of K between edges\nhaving observations, computed by Eq. (1)-(3). Then observations on edges are generated by\n\n(7)\niid\u223c N (0, \u03b2\u22121), and the mean has a parametric form bi,j = \u00b5i + \u03bdj. In the\nwhere f \u223c N (0, K), \u0001i,j\ndirected/undirected graph case we let \u00b5i = \u03bdi for any i \u2208 U. f can be analytically marginalized out,\nthe marginal distribution of observations is then\n\nyi,j = f(i, j) + bi,j + \u0001i,j\n\np(y|\u03b8) = N (y; b, K + \u03b2\u22121I),\n\n(8)\n\n3\n\n\fL(\u03b8) =\n\n|O|\n2\n\nwhere \u03b8 = {\u03b2, b, K}. The parameters can be estimated by minimizing the penalized negative log-\nlikelihood L(\u03b8) = \u2212 ln p(y|\u03b8) + (cid:96)(\u03b8) under a suitable regularization (cid:96)(\u03b8). The objective function\nhas the form:\n\nln|C| +\n\n1\n2\n\n1\n2\n\ntr\n\nlog 2\u03c0 +\n\n(9)\nwhere C = K + \u03b2\u22121I, m = y \u2212 b and b = [bi,j], (i, j) \u2208 O. (cid:96)(\u03b8) will be con\ufb01gured in Sec. 3.1.\nGradient-based optimization packages can be applied to \ufb01nd a local optimum of \u03b8. However the\ncomputation can be prohibitively high when the size |O| of measured edges is very big, because the\nmemory cost is O(|O|2), and the computational cost is O(|O|3). In our experiments |O| is about\ntens of thousands or even millions. A slightly improved algorithm was introduced in [19], with\na complexity O(M 3 + N 3) cubic to the size of nodes. The algorithm employed a non-Gaussian\napproximation based on Theorem (2.1) and is applicable to only bipartite graphs.\n\n+ (cid:96)(\u03b8),\n\n(cid:163)\nC\u22121mm(cid:62)(cid:164)\n\nWe reduce the memory and computational cost by exploring the special structure of K as discussed\nin Sec. 2 and assume K to be composed by node-wise linear kernels \u03a3(i, i(cid:48)) = (cid:104)xi, xi(cid:48)(cid:105), \u2126(i, i(cid:48)) =\n(cid:104)zj, zj(cid:48)(cid:105), and C(i, j) = (cid:104)xi, xj(cid:105), with x \u2208 RL1 and z \u2208 RL2. The edge-wise covariance is then\n\n\u2022 Bipartite Graphs: K ((i, j), (i(cid:48), j(cid:48))) = (cid:104)xi \u2297 zj, xi(cid:48) \u2297 zj(cid:48)(cid:105).\n\u2022 Directed Graphs: K ((i, j), (i(cid:48), j(cid:48))) = (cid:104)xi \u2297 xj, xi(cid:48) \u2297 xj(cid:48)(cid:105).\n\u2022 Undirected Graphs: K ((i, j), (i(cid:48), j(cid:48))) = (cid:104)xi \u2297 xj, xi(cid:48) \u2297 xj(cid:48)(cid:105) + (cid:104)xi \u2297 xj, xj(cid:48) \u2297 xi(cid:48)(cid:105)\n\n(cid:163)\n\n(cid:164)\n\nWe turn the problem of optimizing K into the problem of optimizing X = [x1, . . . , xM ](cid:62) and\nZ = [z1, . . . , zN ](cid:62).\nIt is important to note that in all the cases the kernel matrix has the form\nK = UU(cid:62), where U is an |O| \u00d7 L matrix, L (cid:191) |O|, therefore applying the Woodbury identity\nC\u22121 = \u03b2[I\u2212U(U(cid:62)U+\u03b2\u22121I)\u22121U(cid:62)] can dramatically reduce the computational cost. For example,\nin the bipartite graph case and the directed graph case, respectively there are\n\nU(cid:62) =\n\nxi \u2297 zj\n\nand U(cid:62) =\n\nxi \u2297 xj\n\n(i,j)\u2208O,\n\n(10)\nwhere the rows of U are indexed by (i, j) \u2208 O. For the undirected graph case, we \ufb01rst rewrite the\nkernel function\nK ((i, j), (i(cid:48), j(cid:48))) = (cid:104)xi \u2297 xj, xi(cid:48) \u2297 xj(cid:48)(cid:105) + (cid:104)xi \u2297 xj, xj(cid:48) \u2297 xi(cid:48)(cid:105)\n=\n\n(cid:104)xi \u2297 xj, xi(cid:48) \u2297 xj(cid:48)(cid:105) + (cid:104)xj \u2297 xi, xj(cid:48) \u2297 xi(cid:48)(cid:105) + (cid:104)xi \u2297 xj, xj(cid:48) \u2297 xi(cid:48)(cid:105) + (cid:104)xj \u2297 xi, xi(cid:48) \u2297 xj(cid:48)(cid:105)\n(xi \u2297 xj + xj \u2297 xi), (xi(cid:48) \u2297 xj(cid:48) + xj(cid:48) \u2297 xi(cid:48))\n\n(cid:104)\n(cid:104)(cid:173)\n\n(i,j)\u2208O,\n\n(cid:174)(cid:105)\n\n(11)\n\n(cid:105)\n\n=\n\n,\n\n(cid:164)\n\n1\n2\n1\n2\n\n(cid:163)\n\n(cid:105)\n\n(cid:104)\n\nand then obtain a simple form for the undirected graph case\nxi \u2297 xj + xj \u2297 xi\n\nU(cid:62) =\n\n(12)\nThe overall computational cost is at O(L3 + |O|L2). Empirically we found that the algorithm is\nef\ufb01cient to handle L = 500 when |O| is about millions. The gradients with respect to U can be\nfound in [12]. Further calculation of gradients with respect to X and Z can be easily derived. Here\nwe omit the details for saving the space. Finally, in order to predict the missing measurements, we\nonly need to estimate a simple linear model f(i, j) = w(cid:62)ui,j + bi,j.\n\n(i,j)\u2208O\n\n1\u221a\n2\n\n3.1\n\nIncorporating Additional Attributes and Learning from Discrete Observations\n\nThere are different ways to incorporate node or edge attributes into our model. A common practice\nis to let the kernel K, \u03a3, or \u2126 be some parametric function of attributes. One such choice is the RBF\nfunction. However, node or edge attributes are typically local information while the network itself\nis rather a global dependence structure, thus the network data often has a large part of patterns that\nare independent of those known predictors. In the following, via the example of placing a Bayesian\nprior on \u03a3 : U\u00d7U \u2192 R, we describe a \ufb02exible solution to incorporate additional knowledge. Let \u03a30\nZ exp(\u2212\u03c4 E(\u03a3))\nbe the covariance that we wish \u03a3 to be apriori close to. We apply the prior p(\u03a3) = 1\nand use its negative log-likelihood as a regularization for \u03a3:\nlog |\u03a3 + \u03b3\u22121I| + tr\n\n(\u03a3 + \u03b3\u22121I)\u22121\u03a30\n\n(cid:162)(cid:105)\n\n(13)\n\n(cid:104)\n\n(cid:161)\n\n(cid:96)(\u03a3) = \u03c4 E(\u03a3) = \u03c4\n2\n\n4\n\n\fwhere \u03c4 is a hyperparameter predetermined on validation data, and \u03b3\u22121 is a small number to be\noptimized. The energy function E(\u03a3) is related to the KL divergence DKL(GP(0, \u03a30)||GP(0, \u03a3 +\n\u03b3\u22121\u03b4)), where \u03b4(\u00b7,\u00b7) is the dirac kernel. If we let \u03a30 be the linear kernel of attributes, normalized\nby the dimensionality, then E(\u03a3) can be derived from a likelihood of \u03a3 as if each dimension of the\nattributes is a random sample from GP(0, \u03a3 + \u03b3\u22121\u03b4). If the attributes are nonlinear predictors we\ncan conveniently set \u03a30 by a nonlinear kernel. We set \u03a30 = I if the corresponding attributes are\nabsent. (cid:96)(\u2126), (cid:96)(C) and (cid:96)(K) can be set in the same way.\nThe observations can be discrete variables rather than real values. In this case, an appropriate like-\nlihood function can be devised accordingly. For example, the probit function could be employed\nas the likelihood function for binary labels, which relates f(i, j) to the target yi,j \u2208 {\u22121, +1}, by\na cumulative normal \u03a6 (yi,j(f(i, j) + bi,j)). To preserve computationally tractability, a family of\ninference techniques, e.g. Laplace approximation, can be applied to \ufb01nding a Gaussian distribution\nthat approximates the true likelihood. Then, the marginal likelihood (8) can be written as an explicit\nexpression and the gradient can be derived analytically as well.\n\n4 Discussions on Related Work\n\nTransfer Learning: As we have suggested before, the link prediction for bipartite graphs has a tight\nconnection to transfer learning. To make it clear, let fj(\u00b7) = f(\u00b7, j), then the edge-wise function\nf : U \u00d7 V \u2192 R consists of N node-wise functions fj : U \u2192 R for j = 1, . . . , N. If we \ufb01x\n\u2126(j, j(cid:48)) \u2261 \u03b4(j, j(cid:48)), namely a Dirac delta function, then fj are assumed to be i.i.d. GP functions\nfrom GP(0, \u03a3), where each function corresponds to one learning task. This is the hierarchical\nBaysian model that assumes multiple tasks sharing the same GP prior [18]. In particular, the negative\nlogarithm of p\n\nis\n\n(cid:161){yi,j},{fj}|\u03a3\n(cid:179)\n(cid:180)\n\n(cid:162)\nN(cid:88)\n\nL\n\n{fj}, \u03a3\n\n=\n\n\uf8ee\uf8f0(cid:88)\n\n(cid:161)\n\n(cid:162)\n\n\uf8f9\uf8fb + N\n\n2\n\nl\n\nyi,j, fj(i)\n\n+\n\n1\n2\n\nf j\u03a3\u22121f j\n\nj=1\n\ni\u2208Oj\n\nlog |\u03a3|,\n\n(14)\n\nwhere l(yi,j, fj(i)) = \u2212 log p(yi,j|fj(i)). The form is close to the recent convex multi-task learning\nin a regularization framework [3], if the log-determinant term is replaced by a trace regularization\nterm \u03bbtr(\u03a3). It was proven in [3] that if l(\u00b7,\u00b7) is convex with fj, then the minimization of (14)\nis convex with jointly {fj} and \u03a3. The GP approach differs from the regularization approach in\ntwo aspects: (1) fj are treated as random variables which are marginalized out, thus we only need to\nestimate \u03a3; (2) The regularization for \u03a3 is a non-convex log-determinant term. Interestingly, because\nlog |\u03a3| \u2264 tr(\u03a3)\u2212M, the trace norm is the convex envelope for the log-determinant, and thus the two\nminimization problems are somehow doing similar things. However, the framework introduced in\nthis paper goes beyond the two methods by introducing an informative kernel \u2126 between tasks. From\na probabilistic modeling point of view, the independence of {fj} conditioned on \u03a3 is a restrictive\nassumption and even incorrect when some task-speci\ufb01c attributes are given (which means that {fj}\nare not exchangeable anymore). The task-speci\ufb01c kernel for transfer learning has been recently\nintroduced in [4], which however increased the computational complexity by a factor of N 2. One\ncontribution of this paper on transfer learning is an algorithm that can ef\ufb01ciently solve the learning\nproblem with both data kernel \u03a3 and task kernel \u2126.\nGaussian Process Latent-Variable Model (GPLVM): Our learning algorithm is also a generaliza-\ntion of GPLVM. If we enforce \u2126(j, j(cid:48)) = \u03b4(j, j(cid:48)) in the model of bipartite graphs, then the evidence\nEq. (9) is equivalent to the form of GPLVM,\n\n1\n2\n\ntr\n\nln|(\u03a3 + \u03b2\u22121I)| +\n\n(\u03a3 + \u03b2\u22121I)\u22121YY(cid:62)\n\nL(\u03a3, \u03b2) = M N\n2\n\nlog 2\u03c0 + N\n2\n\n(15)\nwhere Y is a fully observed M \u00d7 N matrix, the mean B = 0, and there is no further regularization\non \u03a3. GPLVM assumes that columns of Y are conditionally independent given \u03a3. In this paper we\nconsider a situation with complex dependence of edges in network graphs.\nOther Related Work: Getoor et al. [7] introduced link uncertainty in the framework of probabilistic\nrelational models. Latent-class relational models [17, 11, 1] have been popular, aiming to \ufb01nd\nthe block structure of links. Link prediction was casted as structured-output prediction in [15, 2].\nStatistical models based on matrix factorization was studied by [8]. Our work is similar to [8] in the\n\n,\n\n(cid:105)\n\n(cid:104)\n\n5\n\n\fFigure 1: The left-hand side: the subset of the UMist Faces data that contains 10 people at 10 different views.\nThe blank blocks indicate the ten knocked-off images as test cases; The right-hand side: the ten knocked-off\nimages (the \ufb01rst row) along with predictive images. The second row is of our results, the third row is of the\nMMMF results, and the fourth row is of the bilinear results.\n\nsense that relations are modeled by multiplications of node-wise factors. Very recently, Hoff showed\nin [9] that the multiplicative model generalizes the latent-class models [11, 1] and can encode the\ntransitivity of relations.\n\n5 Numerical Experiments\n\nWe set the dimensionality of the model via validation on 10% of training data. In cases that the\nadditional attributes on nodes or edges are either unavailable or very weak, we compare our method\nwith max-margin matrix factorization (MMMF) [14] using a square loss, which is similar to singular\nvalue decomposition (SVD) but can handle missing measurements.\n\n5.1 A Demonstration on Face Reconstruction\nA subset of the UMist Faces images of size 112 \u00d7 92 was selected to illustrate our algorithm, which\nconsists of 10 people at 10 different views. We manually knocked 10 images off as test cases, as\npresented in Figure 1, and treated each image as a vector that leads to a 103040 \u00d7 10 matrix with\n103040 missing values, where each column corresponds a view of faces. GP was trained by setting\nL1 = L2 = 4 on this matrix to learn from the appearance relationships between person identity\nand pose. The images recovered by GP for the test cases are presented as the second row of Figure\n1-right (RMSE=0.2881). The results of MMMF are presented as the third row (RMSE=0.4351). We\nalso employed the bilinear models introduced by [16], which however does not handle missing data\nof a matrix, and put the results at the bottom row for comparison. Quantitatively and perceptually\nour model offers a better generalization to unseen views of known persons.\n\n5.2 Collaborative Filtering\n\nCollaborative \ufb01ltering is a typical case of bipartite graphs, where ratings are measurements on edges\nof user-item pairs. We carried out a serial of experiments on the whole EachMovie data, which\nincludes 61265 users\u2019 2811718 distinct numeric ratings on 1623 movies. We randomly selected\n80% of each user\u2019s ratings for training and used the remaining 20% as test cases. The random\nselection was carried out 20 times independently.\nFor comparison purpose, we also evaluated the predictive performance of four other approaches:\n1) Movie Mean: the empirical mean of ratings per movie was used as the predictive value of all\nusers\u2019 rating on the movie; 2) User Mean:\nthe empirical mean of ratings per user was used as\nthe predictive value of the users\u2019 rating on all movies; 3) Pearson Score: the Pearson correlation\ncoef\ufb01cient corresponds to a dot product between normalized rating vectors. We computed the Gram\nmatrices of the Pearson score with mean imputation for movies and users respectively, and took\nprincipal components as their individual attributes. We tried 20 or 50 principal components as\nattributes in this experiment and carried out least square regression on observed entries. 4) MMMF.\nThe optimal rank was decided by validation.\n\n6\n\n\fTable 1: Test results on the EachMovie data. The number in bracket indicates the rank we applied. The\nresults are averaged over 20 trials, along with the standard deviation. To evaluate accuracy, we utilize root\nmean squared error (RMSE), mean absolute error (MAE), and normalized mean squared error, i.e. ,the RMSE\nnormalized by the standard deviation of observations.\n\nMAE\n\nRMSE\n\nNMSE\n\nMETHODS\nMOVIE MEAN 1.3866\u00b10.0013 1.1026\u00b10.0010 0.7844\u00b10.0012\nUSER MEAN 1.4251\u00b10.0011 1.1405\u00b10.0009 0.8285\u00b10.0008\nPEARSON(20) 1.3097\u00b10.0012 1.0325\u00b10.0013 0.6999\u00b10.0011\nPEARSON(50) 1.3034\u00b10.0018 1.0277\u00b10.0015 0.6931\u00b10.0019\n1.2245\u00b10.0503 0.9392\u00b10.0246 0.6127\u00b10.0516\nMMMF(3)\nMMMF(15) 1.1696\u00b10.0283 0.8918\u00b10.0146 0.5585\u00b10.0286\n1.1557\u00b10.0010 0.8781\u00b10.0009 0.5449\u00b10.0011\nGP(3)\n\nTable 2: Test results on the Cora data. The classi\ufb01cation accuracy rate is averaged over 5 trials, each with 4\nfolds for training and one fold for test.\n\nDS\n\nPL\n\nHA\n\nMETHODS\nCONTENT 53.70\u00b10.50 67.50\u00b11.70 68.30\u00b11.60 56.40\u00b10.70\n48.90\u00b11.70 65.80\u00b11.40 60.70\u00b11.10 58.20\u00b10.70\nLINK\nPCA(50) 61.61\u00b11.42 69.36\u00b11.36 70.06\u00b10.90 60.26\u00b11.16\n62.10\u00b10.84 75.40\u00b10.80 78.30\u00b10.78 63.25\u00b10.60\nGP(50)\n\nML\n\nThe results of these approaches are reported in Table 1. The per-movie average yields much better\nresults than the per-user average, which is consistent with the \ufb01ndings previously reported by [5].\nThe improvement is noticeable by using more components of the Pearson score, but not signi\ufb01cant.\nThe generalization performance of our algorithm is better than that of others. T-test showed a signif-\nicant difference with p-value 0.0387 of GP over MMMF (with 15 dimensions) in terms of RMSE.\nIt is well worth highlighting another attractiveness of our algorithm \u2013 the compact representation of\nfactors. On the EachMovie data, there are only three factors that well represent thousands of items\nindividually. We also trained MMMF with 3 factors as well. Although the three-factor solution GP\nfound is also accessible to other models, MMMF failed to achieve comparable performance on this\ncase (i.e., see results of MMMF(3)). In each trial, the number of training samples is around 2.25\nmillion. Our program took about 865 seconds to accomplish 500 L-BFGS updates on all 251572\nparameters using an AMD Opteron 2.6GHz processor.\n\n5.3 Text Categorization based on Contents and Links\n\nWe used a part of Cora corpus including 751 papers on data structure (DS), 400 papers on hardware\nand architecture (HA), 1617 on machine learning (ML) and 1575 on programming language (PL).\nWe treated the citation network as a directed graph and modeled the link existence as binary labels.\nOur model applied the probit likelihood and learned a node-wise covariance function C, L = 50 \u00d7\n50, which composes an edge-wise covariance K by Eq. (2). We set the prior covariance C0 by the\nlinear kernel computed by bag-of-word content attributes. Thus the learned linear features encode\nboth link and content information, which were then used for document classi\ufb01cation. We compare\nseveral other methods that provide linear features for one-against-all categorization using SVM: 1)\nCONTENT: bag-of-words features; 2) LINK: each paper\u2019s citation list; 3) PCA: 50 components\nby PCA on the concatenation of bag-of-word features and citation list for each paper. We chose\nthe dimensionality 50 for both GP and PCA, because their performances both saturated when the\ndimensionality exceeds 50. We reported results based on 5-fold cross validation in Table 2. GP\nclearly outperformed other methods in 3 out of 4 categories. The main reason we believe is that our\napproach models the in-bound and out-bound behaviors simultaneously for each paper .\n\n6 Conclusion and Extensions\n\nIn this paper we proposed GPs for modeling data living on links of networks. We described solu-\ntions to handle directed and undirected links, as well as links connecting heterogenous nodes. This\nwork paves a way for future extensions for learning more complex relational data. For example, we\ncan model a network containing both directed and undirected links. Let (i, j) be directed and (i(cid:48), j(cid:48))\n\u221a\nbe undirected. Based on the feature representations, Eq.(10)-right for directed links and Eq.(12)\nfor undirected links, the covaraince is K((i, j), (i(cid:48), j(cid:48))) = 1/\n2[C(i, i(cid:48))C(j, j(cid:48)) + C(i, j(cid:48))C(j, i(cid:48))],\n\n7\n\n\fwhich indicates that dependence between a directed link and an undirected link is penalized com-\npared to dependence between two undirected links. Moreover, GPs can be employed to model\nmultiple networks involving multiple different types of nodes. For each type, we use one node-wise\ncovariance. Letting covariance between two different types of nodes be zero, we obtain a huge\nblock-diagonal node-wise covariance matrix, where each block corresponds to one type of nodes.\nThis big covariance matrix will induce the edge-wise covariance for links connecting nodes of the\nsame or different types. In the near future it is promising to apply the model to various link prediction\nor network completion problems.\n\nReferences\n\n[1] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, Mixed membership stochastic block\nmodels for relational data with application to protein-protein interactions. Biometrics Society\nAnnual Meeting, 2006.\n\n[2] S. Andrews and T. Jebara, Structured Network Learning. NIPS Workshop on Learning to\n\nCompare Examples, 2006.\n\n[3] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learn-\n\ning, 2007.\n\n[4] E. V. Bonilla, F. V. Agakov, and C. K. I. Williams. Kernel multi-task learning using task-\n\nspeci\ufb01c features. International Conferences on Arti\ufb01cial Intelligence and Statistics, 2007.\n\n[5] J. Canny. Collaborative \ufb01ltering with privacy via factor analysis. International ACM SIGIR\n\nConference , 2002.\n\n[6] W. Chu, V. Sindhwani, Z. Ghahramani, and S. S. Keerthi. Relational learning with gaussian\n\nprocesses. Neural Informaiton Processing Systems 19, 2007.\n\n[7] L. Getoor, E. Segal, B. Taskar, and D. Koller. Probabilistic models of text and link structure\n\nfor hypertext classi\ufb01cation. ICJAI Workshop, 2001.\n\n[8] P. Hoff. Multiplicative latent factor models for description and prediction of social networks.\n\nto appear in Computational and Mathematical Organization Theory, 2007.\n\n[9] P. Hoff. Modeling homophily and stochastic equivalence in symmetric relational data.\n\nappear in Neural Informaiton Processing Systems 20, 2007.\n\nto\n\n[10] A. Kapoor, Y. Qi, H. Ahn, and R. W. Picard. Hyperparameter and kernel learning for graph\n\nbased semi-supervised classi\ufb01cation. Neural Informaiton Processing Systems 18, 2006.\n\n[11] C. Kemp, J. B. Tenenbaum, T. L. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of\n\nconcepts with an in\ufb01nite relational model. AAAI Conference on Arti\ufb01cial Intelligence, 2006.\n\n[12] N. Lawrence. Gaussian process latent variable models. Journal of Machine Learning Research,\n\n2005.\n\n[13] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT\n\nPress, 2006.\n\n[14] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative\n\nprediction. International Conference on Machine Learning, 2005.\n\n[15] B. Taskar, M. F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. Neural\n\nInformaiton Processing Systems 16, 2004.\n\n[16] J. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models. Neural\n\nComputation, 2000.\n\n[17] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. In\ufb01nite hidden relational models. International\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, 2006.\n\n[18] K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks.\n\nInternational Conference on Machine Learning, 2005.\n\n[19] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link\n\nprediction. Neural Informaiton Processing Systems 19, 2007.\n\n[20] X. Zhu, J. Lafferty, and Z. Ghahramani. Semi-supervised learning: From gaussian \ufb01elds to\n\ngaussian processes. Technical Report CMU-CS-03-175, Carnegie Mellon University, 2003.\n\n8\n\n\f", "award": [], "sourceid": 928, "authors": [{"given_name": "Kai", "family_name": "Yu", "institution": null}, {"given_name": "Wei", "family_name": "Chu", "institution": null}]}