{"title": "Nonparametric Latent Feature Models for Link Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 1276, "page_last": 1284, "abstract": "As the availability and importance of relational data -- such as the friendships summarized on a social networking website -- increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting nonparametric Bayesian methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable -- latent features -- using a nonparametric Bayesian technique to simultaneously infer the number of features at the same time we learn which entities have each feature. The greater expressiveness of this approach allows us to improve link prediction on three datasets.", "full_text": "Nonparametric Latent Feature Models\n\nfor Link Prediction\n\nKurt T. Miller\n\nEECS\n\nUniversity of California\n\nBerkeley, CA 94720\n\nThomas L. Grif\ufb01ths\n\nPsychology and Cognitive Science\n\nUniversity of California\n\nBerkeley, CA 94720\n\ntadayuki@cs.berkeley.edu\n\ntom griffiths@berkeley.edu\n\nMichael I. Jordan\nEECS and Statistics\n\nUniversity of California\n\nBerkeley, CA 94720\n\njordan@cs.berkeley.edu\n\nAbstract\n\nAs the availability and importance of relational data\u2014such as the friendships sum-\nmarized on a social networking website\u2014increases, it becomes increasingly im-\nportant to have good models for such data. The kinds of latent structure that have\nbeen considered for use in predicting links in such networks have been relatively\nlimited. In particular, the machine learning community has focused on latent class\nmodels, adapting Bayesian nonparametric methods to jointly infer how many la-\ntent classes there are while learning which entities belong to each class. We pursue\na similar approach with a richer kind of latent variable\u2014latent features\u2014using a\nBayesian nonparametric approach to simultaneously infer the number of features\nat the same time we learn which entities have each feature. Our model combines\nthese inferred features with known covariates in order to perform link prediction.\nWe demonstrate that the greater expressiveness of this approach allows us to im-\nprove performance on three datasets.\n\n1 Introduction\n\nStatistical analysis of social networks and other relational data has been an active area of research for\nover seventy years and is becoming an increasingly important problem as the scope and availability\nof social network datasets increase [1]. In these problems, we observe the interactions between a set\nof entities and we wish to extract informative representations that are useful for making predictions\nabout the entities and their relationships. One basic challenge is link prediction, where we observe\nthe relationships (or \u201clinks\u201d) between some pairs of entities in a network (or \u201cgraph\u201d) and we try\nto predict unobserved links. For example, in a social network, we might only know some subset of\npeople are friends and some are not, and seek to predict which other people are likely to get along.\nOur goal is to improve the expressiveness and performance of generative models based on extracting\nlatent structure representing the properties of individual entities from the observed data, so we will\nfocus on these kinds of models. This rules out approaches like the popular p\u2217 model that uses global\nquantities of the graph, such as how many edges or triangles are present [2, 3]. Of the approaches\nthat do link prediction based on attributes of the individual entities, these can largely be classi\ufb01ed\ninto class-based and feature-based approaches. There are many models that can be placed under\nthese approaches, so we will focus on the models that are most comparable to our approach.\n\n1\n\n\fMost generative models using a class-based representation are based on the stochastic blockmodel,\nintroduced in [4] and further developed in [5]. In the most basic form of the model, we assume there\nare a \ufb01nite number of classes that entities can belong to and that these classes entirely determine the\nstructure of the graph, with the probability of a link existing between two entities depending only\non the classes of those entities. In general, these classes are unobserved, and inference reduces to\nassigning entities to classes and inferring the class interactions. One of the important issues that arise\nin working with this model is determining how many latent classes there are for a given problem.\nThe In\ufb01nite Relational Model (IRM) [6] used methods from nonparametric Bayesian statistics to\ntackle this problem, allowing the number of classes to be determined at inference time. The In\ufb01nite\nHidden Relational Model [7] further elaborated on this model and the Mixed Membership Stochastic\nBlockmodel (MMSB) [8] extended it to allow entities to have mixed memberships.\nAll these class-based models share a basic limitation in the kinds of relational structure they natu-\nrally capture. For example, in a social network, we might \ufb01nd a class which contains \u201cmale high\nschool athletes\u201d and another which contains \u201cmale high school musicians.\u201d We might believe these\ntwo classes will behave similarly, but with a class-based model, our options are to either merge the\nclasses or duplicate our knowledge about common aspects of them. In a similar vein, with a limited\namount of data, it might be reasonable to combine these into a single class \u201cmale high school stu-\ndents,\u201d but with more data we would want to split this group into athletes and musicians. For every\nnew attribute like this that we add, the number of classes would potentially double, quickly leading\nto an overabundance of classes. In addition, if someone is both an athlete and a musician, we would\neither have to add another class for that or use a mixed membership model, which would say that\nthe more a student is an athlete, the less he is a musician.\nAn alternative approach that addresses this problem is to use features to describe the entities. There\ncould be a separate feature for \u201chigh school student,\u201d \u201cmale,\u201d \u201cathlete,\u201d and \u201cmusician\u201d and the\npresence or absence of each of these features is what de\ufb01nes each person and determines their\nrelationships. One class of latent-feature models for social networks has been developed by [9, 10,\n11], who proposed real-valued vectors as latent representations of the entities in the network where\ndepending on the model, either the distance, inner product, or weighted combination of the vectors\ncorresponding to two entities affects the likelihood of there being a link between them. However,\nextending our high school student example, we might hope that instead of having arbitrary real-\nvalued features (which are still useful for visualization), we would infer binary features where each\nfeature could correspond to an attribute like \u201cmale\u201d or \u201cathlete.\u201d Continuing our earlier example, if\nwe had a limited amount of data, we might not pick up on a feature like \u201cathlete.\u201d However, as we\nobserve more interactions, this could emerge as a clear feature. Instead of doubling the numbers of\nclasses in our model, we simply add an additional feature. Determining the number of features will\ntherefore be of extreme importance.\nIn this paper, we present the nonparametric latent feature relational model, a Bayesian nonpara-\nmetric model in which each entity has binary-valued latent features that in\ufb02uences its relations. In\naddition, the relations depend on a set of known covariates. This model allows us to simultaneously\ninfer how many latent features there are while at the same time inferring what features each entity\nhas and how those features in\ufb02uence the observations. This model is strictly more expressive than\nthe stochastic blockmodel. In Section 2, we describe a simpli\ufb01ed version of our model and then\nthe full model. In Section 3, we discuss how to perform inference. In Section 4, we illustrate the\nproperties of our model using synthetic data and then show that the greater expressiveness of the\nlatent feature representation results in improved link prediction on three real datasets. Finally, we\nconclude in Section 5.\n\n2 The nonparametric latent feature relational model\n\nAssume we observe the directed relational links between a set of N entities. Let Y be the N \u00d7 N\nbinary matrix that contains these links. That is, let yij \u2261 Y (i, j) = 1 if we observe a link from\nentity i to entity j in that relation and yij = 0 if we observe that there is not a link. Unobserved\nlinks are left un\ufb01lled. Our goal will be to learn a model from the observed links such that we can\npredict the values of the un\ufb01lled entries.\n\n2\n\n\f2.1 Basic model\n\nY\n\ni,j\n\n\u201d\n\n>\nj\n\n\u201c\n\n0@X\n\nk,k0\n\n1A\n\nIn our basic model, each entity is described by a set of binary features. We are not given these\nfeatures a priori and will attempt to infer them. We assume that the probability of having a link\nfrom one entity to another is entirely determined by the combined effect of all pairwise feature\nIf there are K features, then let Z be the N \u00d7 K binary matrix where each row\ninteractions.\ncorresponds to an entity and each column corresponds to a feature such that zik \u2261 Z(i, k) = 1 if the\nith entity has feature k and zik = 0 otherwise. and let Zi denote the feature vector corresponding to\nentity i. Let W be a K \u00d7 K real-valued weight matrix where wkk0 \u2261 W (k, k0) is the weight that\naffects the probability of there being a link from entity i to entity j if both entity i has feature k and\nentity j has feature k0.\nWe assume that links are independent conditioned on Z and W , and that only the features of entities\ni and j in\ufb02uence the probability of a link between those entities. This de\ufb01nes the likelihood\n\nPr(Y |Z, W ) =\n\nPr(yij|Zi, Zj, W )\n\n(1)\n\nwhere the product ranges over all pairs of entities. Given the feature matrix Z and weight matrix W ,\nthe probability that there is a link from entity i to entity j is\n\nPr(yij = 1|Z, W ) = \u03c3\n\nZiW Z\n\n= \u03c3\n\nzikzjk0 wkk0\n\n(2)\n\n1\n\nwhere \u03c3(\u00b7) is a function that transforms values on (\u2212\u221e,\u221e) to (0, 1) such as the sigmoid function\n\u03c3(x) =\n1+exp(\u2212x) or the probit function \u03c3(x) = \u03a6(x). An important aspect of this model is that\nall-zero columns of Z do not affect the likelihood. We will take advantage of this in Section 2.2.\nThis model is very \ufb02exible. With a single feature per entity, it is equivalent to a stochastic block-\nmodel. However, since entities can have more than a single feature, the model is more expressive. In\nthe high school student example, each feature can correspond to an attribute like \u201cmale,\u201d \u201cmusician,\u201d\nand \u201cathlete.\u201d If we were looking at the relation \u201cfriend of\u201d (not necessarily symmetric!), then the\nweight at the (athlete, musician) entry of W would correspond to the weight that an athlete would be\na friend of a musician. A positive weight would correspond to an increased probability, a negative\nweight a decreased probability, and a zero weight would indicate that there is no correlation between\nthose two features and the observed relation. The more positively correlated features people have,\nthe more likely they are to be friends. Another advantage of this representation is that if our data\ncontained observations of students in two distant locations, we could have a geographic feature for\nthe different locations. While other features such as \u201cathlete\u201d or \u201cmusician\u201d might indicate that one\nperson could be a friend of another, the geographic features could have extremely negative weights\nso that people who live far from each other are less likely to be friends. However, the parameters\nfor the non-geographic features would still be tied for all people, allowing us to make stronger in-\nferences about how they in\ufb02uence the relations. Class-based models would need an abundance of\nclasses to capture these effects and would not have the same kind of parameter sharing.\nGiven the full set of observations Y , we wish to infer the posterior distribution of the feature matrix\nZ and the weights W . We do this using Bayes\u2019 theorem, p(Z, W|Y ) \u221d p(Y |Z, W )p(Z)p(W ),\nwhere we have placed an independent prior on Z and W . Without any prior knowledge about the\nfeatures or their weights, a natural prior for W involves placing an independent N(0, \u03c32\nw) prior on\neach wij. However, placing a prior on Z is more challenging. If we knew how many features there\nwere, we could place an arbitrary parametric prior on Z. However, we wish to have a \ufb02exible prior\nthat allows us to simultaneously infer the number of features at the same time we infer all the entries\nin Z. The Indian Buffet Process is such a prior.\n\n2.2 The Indian Buffet Process and the basic generative model\n\nAs mentioned in the previous section, any features which are all-zero do not affect the likelihood.\nThat means that even if we added an in\ufb01nite number of all-zero features, the likelihood would remain\nthe same. The Indian Buffet Process (IBP) [12] is a prior on in\ufb01nite binary matrices such that with\nprobability one, a feature matrix drawn from it for a \ufb01nite number of entities will only have a \ufb01nite\nnumber of non-zero features. Moreover, any feature matrix, no matter how many non-zero features\n\n3\n\n\fit contains, has positive probability under the IBP prior. It is therefore a useful nonparametric prior\nto place on our latent feature matrix Z.\nThe generative process to sample matrices from the IBP can be described through a culinary\nmetaphor that gave the IBP its name. In this metaphor, each row of Z corresponds to a diner at an\nIndian buffet and each column corresponds to a dish at the in\ufb01nitely long buffet. If a customer takes\na particular dish, then the entry that corresponds to the customer\u2019s row and the dish\u2019s column is a one\nand the entry is zero otherwise. The culinary metaphor describes how people choose the dishes. In\nthe IBP, the \ufb01rst customer chooses a Poisson(\u03b1) number of dishes to sample, where \u03b1 is a parameter\nof the IBP. The ith customer tries each previously sampled dish with probability proportional to the\nnumber of people that have already tried the dish and then samples a Poisson(\u03b1/i) number of new\ndishes. This process is exchangeable, which means that the order in which the customers enter the\nrestaurant does not affect the con\ufb01guration of the dishes that people try (up to permutations of the\ndishes as described in [12]). This insight leads to a straightforward Gibbs sampler to do posterior\ninference that we describe in Section 3.\nUsing an IBP prior on Z, our basic generative latent feature relational model is:\n\nZ \u223c IBP(\u03b1)\nwkk0 \u223c N (0, \u03c32\nw)\n\nyij \u223c \u03c3(cid:0)ZiW Z>\n\nj\n\n(cid:1)\n\nfor all k, k0 for which features k and k0 are non-zero\nfor each observation.\n\n2.3 Full nonparametric latent feature relational model\n\nWe have described the basic nonparametric latent feature relational model. We now combine it\nwith ideas from the social network community to get our full model. First, we note that there are\nmany instances of logit models used in statistical network analysis that make use of covariates in\nlink prediction [2]. Here we will focus on a subset of ideas discussed in [10]. Let Xij be a vector\nthat in\ufb02uences the relation yij, let Xp,i be a vector of known attributes of entity i when it is the\nparent of a link, and let Xc,i be a vector of known attributes of entity i when it is a child of a link.\nFor example, in Section 4.2, when Y represents relationships amongst countries, Xij is a scalar\nrepresenting the geographic similarity between countries (Xij = exp(\u2212d(i, j))) since this could\nin\ufb02uence the relationships and Xp,i = Xc,i is a set of known features associated with each country\n(Xp,i and Xc,i would be distinct if we had covariates speci\ufb01c to each country\u2019s roles). We then let\nc be a normally distributed scalar and \u03b2, \u03b2p, \u03b2c, a, and b be normally distributed vectors in our full\nmodel in which\n\nPr(yij = 1|Z, W, X, \u03b2, a, b, c) = \u03c3\n\nZiW Z\n\n>\nj + \u03b2\n\n>\n\nXij + (\u03b2\n\n>\np Xp,i + ai) + (\u03b2\n\n>\nc Xc,j + bj) + c\n\n.\n\n(3)\n\n\u201d\n\n\u201c\n\nIf we do not have information about one or all of X, Xp, and Xc, we drop the corresponding term(s).\nIn this model, c is a global offset that affects the default likelihood of a relation and ai and bj are\nentity and role speci\ufb01c offsets.\nSo far, we have only considered the case of observing a single relation. It is not uncommon to\nobserve multiple relations for the same set of entities. For example, in addition to the \u201cfriend of\u201d\nrelation, we might also observe the \u201cadmires\u201d and \u201ccollaborates with\u201d relations. We still believe that\neach entity has a single set of features that determines all its relations, but these features will not\naffect each relation in the same way. If we are given m relations, label them Y 1, Y 2, . . . , Y m. We\nwill use the same features for each relation, but we will use an independent weight matrix W i for\neach relation Y i. In addition, covariates might be relation speci\ufb01c or common across all relations.\nRegardless, they will interact in different ways in each relation. Our full model is now\n\nPr(Y 1, . . . , Y m|Z,{W i, X i, \u03b2i, ai, bi, ci}m\n\ni=1) =\n\nPr(Y i|Z, W i, X i, \u03b2i, ai, bi, ci).\n\ni=1\n\n2.4 Variations of the nonparametric latent feature relational model\n\nThe model that we have de\ufb01ned is for directed graphs in which the matrix Y i is not assumed to be\nsymmetric. For undirected graphs, we would like to de\ufb01ne a symmetric model. This is easy to do by\nrestricting W i to be symmetric. If we further believe that the features we learn should not interact,\nwe can assume that W i is diagonal.\n\n4\n\nmY\n\n\f2.5 Related nonparametric latent feature models\n\nThere are two models related to our nonparametric latent feature relational model that both use the\nIBP as a prior on binary latent feature matrices. The most closely related model is the Binary Matrix\nFactorization (BMF) model of [13]. The BMF is a general model with several concrete variants,\nthe most relevant of which was used to predict unobserved entries of binary matrices for image\nreconstruction and collaborative \ufb01ltering. If Y is the observed part of a binary matrix, then in this\nvariant, we assume that Y |U, V, W \u223c \u03c3(U W V >) where \u03c3(\u00b7) is the logistic function, U and V are\nindependent binary matrices drawn from the IBP, and the entries in W are independent draws from a\nnormal distribution. If Y is an N \u00d7 N matrix where we assume the rows and columns have the same\nfeatures (i.e., U = V ), then this special case of their model is equivalent to our basic (covariate-free)\nmodel. While [13] were interested in a more general formalization that is applicable to other tasks,\nwe have specialized and extended this model for the task of link prediction. The other related model\nis the ADCLUS model [14]. This model assumes we are given a symmetric matrix of nonnegative\nsimilarities Y and that Y = ZW Z> + \u0001 where Z is drawn from the IBP, W is a diagonal matrix\nwith entries independently drawn from a Gamma distribution, and \u0001 is independent Gaussian noise.\nThis model does not allow for arbitrary feature interactions nor does it allow for negative feature\ncorrelations.\n\n3 Inference\n\nExact inference in our nonparametric latent feature relational model is intractable [12]. However,\nthe IBP prior lends itself nicely to approximate inference via Markov Chain Monte Carlo [15]. We\n\ufb01rst describe inference in the single relation, basic model, later extending it to the full model. In our\nbasic model, we must do posterior inference on Z and W . Since with probability one, any sample\nof Z will have a \ufb01nite number of non-zero entries, we can store just the non-zero columns of each\nsample of the in\ufb01nite binary matrix Z. Since we do not have a conjugate prior on W , we must also\nsample the corresponding entries of W . Our sampler is as follows:\n\nGiven W , resample Z We do this by resampling each row Zi in succession. When sampling\nentries in the ith row, we use the fact that the IBP is exchangeable to assume that the ith customer in\nthe IBP was the last one to enter the buffet. Therefore, when resampling zik for non-zero columns\nk, if mk is the number of non-zero entries in column k excluding row i, then\nPr(zik = 1|Z\u2212ik, W, Y ) \u221d mk Pr(Y |zik = 1, Z\u2212ik, W ).\n\nWe must also sample zik for each of the in\ufb01nitely many all-zero columns to add features to the\nrepresentation. Here, we use the fact that in the IBP, the prior distribution on the number of new\nfeatures for the last customer is Poisson(\u03b1/N). As described in [12], we must then weight this\nby the likelihood term for having that many new features, computing this for 0, 1, . . . .kmax new\nfeatures for some maximum number of new features kmax and sampling the number of new features\nfrom this normalized distribution. The main dif\ufb01culty arises because we have not sampled the values\nof W for the all-zero columns and we do not have a conjugate prior on W , so we cannot compute\nthe likelihood term exactly. We can adopt one of the non-conjugate sampling approaches from the\nDirichlet process [16] to this task or use the suggestion in [13] to include a Metropolis-Hastings step\nto propose and either accept or reject some number of new columns and the corresponding weights.\nWe chose to use a stochastic Monte Carlo approximation of the likelihood. Once the number of new\nfeatures is sampled, we must sample the new values in W as described below.\n\nGiven Z, resample W We sequentially resample each of the weights in W that correspond to\nnon-zero features and drop all weights that correspond to all-zero features. Since we do not have\na conjugate prior on W , we cannot directly sample W from its posterior. If \u03c3(\u00b7) is the probit, we\nadapt the auxiliary sampling trick from [17] to have a Gibbs sampler for the entries of W . If \u03c3(\u00b7) is\nthe logistic function, no such trick exists and we resort to using a Metropolis-Hastings step for each\nweight in which we propose a new weight from a normal distribution centered around the old one.\n\nHyperparameters We can also place conjugate priors on the hyperparameters \u03b1 and \u03c3w and per-\nform posterior inference on them. We use the approach from [18] for sampling of \u03b1.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: Features and corresponding observations for synthetic data. In (a), we show features that\ncould be explained by a latent-class model that then produces the observation matrix in (b). White\nindicates one values, black indicates zero values, and gray indicates held out values. In (c), we show\nthe feature matrix of our other synthetic dataset along with the corresponding observations in (d).\n(e) shows the feature matrix of a randomly chosen sample from our Gibbs sampler.\n\nMultiple relations\neach i as above. However, when we resample Z, we must compute\n\nIn the case of multiple relations, we can sample Wi given Z independently for\n\nPr(zik = 1|Z\u2212ik,{W, Y }m\n\ni=1) \u221d mk\n\nPr(Y i|zik = 1, Z\u2212ik, W i).\n\nmY\n\ni=1\n\nIn the full model, we must also update {\u03b2i, \u03b2i\n\nFull model\ni=1. By conditioning on\nthese, the update equations for Z and W i take the same form, but with Equation (3) used for the\nc, ai, bi, ci) are\nlikelihood. When we condition on Z and W i, the posterior updates for (\u03b2i, \u03b2i\nindependent and can be derived from the updates in [10].\n\nc, ai, bi, ci}m\n\np, \u03b2i\n\np, \u03b2i\n\nImplementation details Despite the ease of writing down the sampler, samplers for the IBP often\nmix slowly due to the extremely large state space full of local optima. Even if we limited Z to have\nK columns, there are 2N K potential feature matrices. In an effort to explore the space better, we can\naugment the Gibbs sampler for Z by introducing split-merge style moves as described in [13] as well\nas perform annealing or tempering to smooth out the likelihood. However, we found that the most\nsigni\ufb01cant improvement came from using a good initialization. A key insight that was mentioned\nin Section 2.1 is that the stochastic blockmodel is a special case of our model in which each entity\nonly has a single feature. Stochastic blockmodels have been shown to perform well for statistical\nnetwork analysis, so they seem like a reasonable way to initialize the feature matrix. In the results\nsection, we compare the performance of a random initialization to one in which Z is initialized with\na matrix learned by the In\ufb01nite Relational Model (IRM). To get our initialization point, we ran the\nGibbs sampler for the IRM for only 15 iterations and used the resulting class assignments to seed Z.\n\n4 Results\n\nWe \ufb01rst qualitatively analyze the strengths and weaknesses of our model on synthetic data, estab-\nlishing what we can and cannot expect from it. We then compare our model against two class-based\ngenerative models, the In\ufb01nite Relational Model (IRM) [6] and the Mixed Membership Stochastic\nBlockmodel (MMSB) [8], on two datasets from the original IRM paper and a NIPS coauthorship\ndataset, establishing that our model does better than the best of those models on those datasets.\n\n4.1 Synthetic data\n\nWe \ufb01rst focus on the qualitative performance of our model. We applied the basic model to two very\nsimple synthetic datasets generated from known features. These datasets were simple enough that\nthe basic model could attain 100% accuracy on held-out data, but were different enough to address\nthe qualitative characteristics of the latent features inferred. In one dataset, the features were the\nclass-based features seen in Figure 1(a) and in the other, we used the features in Figure 1(c). The\nobservations derived from these features can be seen in Figure 1(b) and Figure 1(d), respectively.\n\n6\n\n\fOn both datasets, we initialized Z and W randomly. With the very simple, class-based model, 50%\nof the sampled feature matrices were identical to the generating feature matrix with another 25%\ndiffering by a single bit. However, on the other dataset, only 25% of the samples were at most a\nsingle bit different than the true matrix. It is not the case that the other 75% of the samples were bad\nsamples, though. A randomly chosen sample of Z is shown in Figure 1(e). Though this matrix is\ndifferent from the true generating features, with the appropriate weight matrix it predicts just as well\nas the true feature matrix. These tests show that while our latent feature approach is able to learn\nfeatures that explain the data well, due to subtle interactions between sets of features and weights,\nthe features themselves will not in general correspond to interpretable features. However, we can\nexpect the inferred features to do a good job explaining the data. This also indicates that there are\nmany local optima in the feature space, further motivating the need for good initialization.\n\n4.2 Multi-relational datasets\n\nIn the original IRM paper, the IRM was applied to several datasets [6]. These include a dataset\ncontaining 54 relations of 14 countries (such as \u201cexports to\u201d and \u201cprotests\u201d) along with 90 given\nfeatures of the countries [19] and a dataset containing 26 kinship relationships of 104 people in the\nAlyawarra tribe in Central Australia [20]. See [6, 19, 20] for more details on the datasets.\nOur goal in applying the latent feature relational model to these datasets was to demonstrate the\neffectiveness of our algorithm when compared to two established class-based algorithms, the IRM\nand the MMSB, and to demonstrate the effectiveness of our full algorithm. For the Alyawarra\ndataset, we had no known covariates. For the countries dataset, Xp = Xc was the set of known\nfeatures of the countries and X was the country distance similarity matrix described in Section 2.3.\nAs mentioned in the synthetic data section, the inferred features do not necessarily have any inter-\npretable meaning, so we restrict ourselves to a quantitative comparison. For each dataset, we held\nout 20% of the data during training and we report the AUC, the area under the ROC (Receiver Oper-\nating Characteristic) curve, for the held-out data [21]. We report results for inferring a global set of\nfeatures for all relations as described in Section 2.3 which we refer to as \u201cglobal\u201d as well as results\nwhen a different set of features is independently learned for each relation and then the AUCs of all\nrelations are averaged together, which we refer to as \u201csingle.\u201d In addition, we tried initializing our\nsampler for the latent feature relational model with either a random feature matrix (\u201cLFRM rand\u201d)\nor class-based features from the IRM (\u201cLFRM w/ IRM\u201d). We ran our sampler for 1000 iterations for\neach con\ufb01guration using a logistic squashing function (though results using the probit are similar),\nthrowing out the \ufb01rst 200 samples as burn-in. Each method was given \ufb01ve random restarts.\n\nTable 1: AUC on the countries and kinship datasets. Bold identi\ufb01es the best performance.\n\nCountries single\nLFRM w/ IRM 0.8521 \u00b1 0.0035\n0.8529 \u00b1 0.0037\nLFRM rand\n0.8423 \u00b1 0.0034\n0.8212 \u00b1 0.0032\n\nIRM\nMMSB\n\nCountries global Alyawarra single Alyawarra global\n0.8772 \u00b1 0.0075\n0.9183 \u00b1 0.0108\n0.7127 \u00b1 0.030\n0.7067 \u00b1 0.0534\n0.8500 \u00b1 0.0033\n0.8943 \u00b1 0.0300\n0.8643 \u00b1 0.0077\n0.9143 \u00b1 0.0097\n\n0.9346 \u00b1 0.0013\n0.9443 \u00b1 0.0018\n0.9310 \u00b1 0.0023\n0.9005 \u00b1 0.0022\n\nResults of these tests are in Table 1. As can be seen, the LFRM with class-based initialization out-\nperforms both the IRM and MMSB. On the individual relations (\u201csingle\u201d), the LFRM with random\ninitialization also does well, beating the IRM initialization on both datasets. However, the random\ninitialization does poorly at inferring the global features due to the coupling of features and the\nweights for each of the relations. This highlights the importance of proper initialization. To demon-\nstrate that the covariates are helping, but that even without them, our model does well, we ran the\nglobal LFRM with class-based initialization without covariates on the countries dataset and the AUC\ndropped to 0.8713 \u00b1 0.0105, which is still the best performance.\nOn the countries data, the latent feature model inferred on average 5-7 features when seeded with\nthe IRM and 8-9 with a random initialization. On the kinship data, it inferred 9-11 features when\nseeded with the IRM and 13-19 when seeded randomly.\n\n7\n\n\f(a) True relations\n\n(b) Feature predictions\n\n(c) IRM predictions\n\n(d) MMSB predictions\n\nFigure 2: Predictions for all algorithms on the NIPS coauthorship dataset. In (a), a white entry\nmeans two people wrote a paper together. In (b-d), the lighter an entry, the more likely that algorithm\npredicted the corresponding people would interact.\n\n4.3 Predicting NIPS coauthorship\n\nAs our \ufb01nal example, highlighting the expressiveness of the latent feature relational model, we used\nthe coauthorship data from the NIPS dataset compiled in [22]. This dataset contains a list of all\npapers and authors from NIPS 1-17. We took the 234 authors who had published with the most\nother people and looked at their coauthorship information. The symmetric coauthor graph can be\nseen in Figure 2(a). We again learned models for the latent feature relational model, the IRM and the\nMMSB training on 80% of the data and using the remaining 20% as a test set. For the latent feature\nmodel, since the coauthorship relationship is symmetric, we learned a full, symmetric weight matrix\nW as described in Section 2.4. We did not use any covariates. A visualization of the predictions for\neach of these algorithms can be seen in Figure 2(b-d). Figure 2 really drives home the difference\nin expressiveness. Stochastic blockmodels are required to group authors into classes, and assumes\nthat all members of classes interact similarly. For visualization, we have ordered the authors by\nthe groups the IRM found. These groups can clearly be seen in Figure 2(c). The MMSB, by\nallowing partial membership is not as restrictive. However, on this dataset, the IRM outperformed\nit. The latent feature relational model is the most expressive of the models and is able to much more\nfaithfully reproduce the coauthorship network.\nThe latent feature relational model also quantitatively outperformed the IRM and MMSB. We again\nran our sampler for 1000 samples initializing with either a random feature matrix or a class-based\nfeature matrix from the IRM and reported the AUC on the held-out data. Using \ufb01ve restarts for each\nmethod, the LFRM w/ IRM performed best with an AUC of 0.9509, the LFRM rand was next with\n0.9466 and much lower were the IRM at 0.8906 and the MMSB at 0.8705 (all at most \u00b10.013). On\naverage, the latent feature relational model inferred 20-22 features when initialized with the IRM\nand 38-44 features when initialized randomly.\n\n5 Conclusion\n\nWe have introduced the nonparametric latent feature relational model, an expressive nonparametric\nmodel for inferring latent binary features in relational entities. This model combines approaches\nfrom the statistical network analysis community, which have emphasized feature-based methods for\nanalyzing network data, with ideas from Bayesian nonparametrics in order to simultaneously infer\nthe number of latent binary features at the same time we infer the features of each entity and how\nthose features interact. Existing class-based approaches infer latent structure that is a special case\nof what can be inferred by this model. As a consequence, our model is strictly more expressive\nthan these approaches, and can use the solutions produced by these approaches for initialization.\nWe showed empirically that the nonparametric latent feature model performs well at link prediction\non several different datasets, including datasets that were originally used to argue for class-based\napproaches. The success of this model can be traced to its richer representations, which make it able\nto capture subtle patterns of interaction much better than class-based models.\n\nAcknowledgments KTM was supported by the U.S. Department of Energy contract DE-AC52-\n07NA27344 through Lawrence Livermore National Laboratory. TLG was supported by grant number FA9550-\n07-1-0351 from the Air Force Of\ufb01ce of Scienti\ufb01c Research.\n\n8\n\n5010015020020406080100120140160180200220501001502002040608010012014016018020022050100150200204060801001201401601802002205010015020020406080100120140160180200220\fReferences\n[1] Stanley Wasserman and Katherine Faust. Social Network Analysis: Methods and Applications. Cambridge\n\nUniversity Press, 1994.\n\n[2] Stanley Wasserman and Philippa Pattison. Logit models and logistic regressions for social networks: I.\n\nan introduction to Markov random graphs and p\u2217. Psychometrika, 61(3):401\u2013425, 1996.\n\n[3] Garry Robins, Tom Snijders, Peng Wang, Mark Handcock, and Philippa Pattison. Recent developments in\nexponential random graph (p*) models for social networks. Social Networks, 29(2):192\u2013215, May 2007.\n[4] Yuchung J. Wang and George Y. Wong. Stochastic blockmodels for directed graphs. Journal of the\n\nAmerican Statistical Association, 82(397):8\u201319, 1987.\n\n[5] Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures.\n\nJournal of the American Statistical Association, 96(455):1077\u20131087, 2001.\n\n[6] Charles Kemp, Joshua B. Tenenbaum, Thomas L. Grif\ufb01ths, Takeshi Yamada, and Naonori Ueda. Learning\nsystems of concepts with an in\ufb01nite relational model. In Proceedings of the American Association for\nArti\ufb01cial Intelligence (AAAI), 2006.\n\n[7] Zhao Xu, Volker Tresp, Kai Yu, and Hans-Peter Kriegel. In\ufb01nite hidden relational models. In Proceedings\n\nof the 22nd Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2006.\n\n[8] Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic\nIn D. Koller, Y. Bengio, D. Schuurmans, and L. Bottou, editors, Advances in Neural\n\nblock models.\nInformation Processing Systems (NIPS) 21. Red Hook, NY: Curran Associates, 2009.\n\n[9] Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network\n\nanalysis. Journal of the American Statistical Association, 97(460):1090\u20131098, 2002.\n\n[10] Peter D. Hoff. Bilinear mixed-effects models for dyadic data. Journal of the American Statistical Associ-\n\nation, 100(469):286\u2013295, 2005.\n\n[11] Peter D. Hoff. Multiplicative latent factor models for description and prediction of social networks.\n\nComputational and Mathematical Organization Theory, 2008.\n\n[12] Thomas L. Grif\ufb01ths and Zoubin Ghahramani. In\ufb01nite latent feature models and the Indian Buffet Process.\nIn Y. Weiss, B. Sch\u00a8olkopf, and J. Platt, editors, Advances in Neural Information Processing Systems\n(NIPS) 18. Cambridge, MA: MIT Press, 2006.\n\n[13] Edward Meeds, Zoubin Ghahramani, Radford Neal, and Sam Roweis. Modeling dyadic data with bi-\nnary latent factors. In B. Sch\u00a8olkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information\nProcessing Systems (NIPS) 19. Cambridge, MA: MIT Press, 2007.\n\n[14] Daniel L. Navarro and Thomas L. Grif\ufb01ths. Latent features in similarity judgment: A nonparametric\n\nBayesian approach. Neural Computation, 20(11):2597\u20132628, 2008.\n\n[15] Christian P. Robert and George Casella. Monte Carlo Statistical Methods. Springer, 2004.\n[16] Radford M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9(2):249\u2013265, 2000.\n\n[17] James H. Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data.\n\nJournal of the American Statistical Association, 88(422):669\u2013679, 1993.\n\n[18] Dilan G\u00a8or\u00a8ur, Frank J\u00a8akel, and Carl Edward Rasmussen. A choice model with in\ufb01nitely many latent\n\nfeatures. In Proceedings of the 23rd International Conference on Machine learning (ICML), 2006.\n\n[19] Rudolph J. Rummel. Dimensionality of nations project: Attributes of nations and behavior of nation\n\ndyads, 1950\u20131965. ICPSR data \ufb01le, 1999.\n\n[20] Woodrow W. Denham. The Detection of Patterns in Alyawarra Nonverbal Behavior. PhD thesis, Univer-\n\nsity of Washington, 1973.\n\n[21] Jin Huang and Charles X. Ling. Using AUC and accuracy in evaluating learning algorithms.\n\nTransactions on Knowledge and Data Engineering, 17(3):299\u2013310, 2005.\n\nIEEE\n\n[22] Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclidean embedding of co-\n\noccurrence data. The Journal of Machine Learning Research, 8:2265\u20132295, 2007.\n\n9\n\n\f", "award": [], "sourceid": 960, "authors": [{"given_name": "Kurt", "family_name": "Miller", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}