{"title": "Reducing the Rank in Relational Factorization Models by Including Observable Patterns", "book": "Advances in Neural Information Processing Systems", "page_first": 1179, "page_last": 1187, "abstract": "Tensor factorizations have become popular methods for learning from multi-relational data. In this context, the rank of a factorization is an important parameter that determines runtime as well as generalization ability. To determine conditions under which factorization is an efficient approach for learning from relational data, we derive upper and lower bounds on the rank required to recover adjacency tensors. Based on our findings, we propose a novel additive tensor factorization model for learning from latent and observable patterns in multi-relational data and present a scalable algorithm for computing the factorization. Experimentally, we show that the proposed approach does not only improve the predictive performance over pure latent variable methods but that it also reduces the required rank --- and therefore runtime and memory complexity --- significantly.", "full_text": "Reducing the Rank of Relational Factorization\n\nModels by Including Observable Patterns\n\nMaximilian Nickel1,2\n\nXueyan Jiang3,4\n\nVolker Tresp3,4\n\n1LCSL, Poggio Lab, Massachusetts Institute of Technology, Cambridge, MA, USA\n\n2Istituto Italiano di Tecnologia, Genova, Italy\n\n3Ludwig Maximilian University, Munich, Germany\n\n4Siemens AG, Corporate Technology, Munich, Germany\n\nmnick@mit.edu, {xueyan.jiang.ext,volker.tresp}@siemens.com\n\nAbstract\n\nTensor factorization has become a popular method for learning from multi-\nrelational data. In this context, the rank of the factorization is an important parame-\nter that determines runtime as well as generalization ability. To identify conditions\nunder which factorization is an ef\ufb01cient approach for learning from relational data,\nwe derive upper and lower bounds on the rank required to recover adjacency tensors.\nBased on our \ufb01ndings, we propose a novel additive tensor factorization model\nto learn from latent and observable patterns on multi-relational data and present\na scalable algorithm for computing the factorization. We show experimentally\nboth that the proposed additive model does improve the predictive performance\nover pure latent variable methods and that it also reduces the required rank \u2014 and\ntherefore runtime and memory complexity \u2014 signi\ufb01cantly.\n\n1\n\nIntroduction\n\nRelational and graph-structured data has become ubiquitous in many \ufb01elds of application such\nas social network analysis, bioinformatics, and arti\ufb01cial intelligence. Moreover, relational data is\ngenerated in unprecedented amounts in projects like the Semantic Web, YAGO [27], NELL [4], and\nGoogle\u2019s Knowledge Graph [5] such that learning from relational data, and in particular learning from\nlarge-scale relational data, has become an important sub\ufb01eld of machine learning. Existing approaches\nto relational learning can approximately be divided into two groups: First, methods that explain\nrelationships via observable variables, i.e. via the observed relationships and attributes of entities, and\nsecond, methods that explain relationships via a set of latent variables. The objective of latent variable\nmodels is to infer the states of these hidden variables which, once known, permit the prediction\nof unknown relationships. Methods for learning from observable variables cover a wide range of\napproaches, e.g. inductive logic programming methods such as FOIL [23], statistical relational\nlearning methods such as Probabilistic Relational Models [6] and Markov Logic Networks [24], and\nlink prediction heuristics based on the Jaccard\u2019s Coef\ufb01cient and the Katz Centrality [16]. Important\nexamples of latent variable models for relational data include the IHRM and the IRM [29, 10], the\nMixed Membership Stochastic Blockmodel [1] and low-rank matrix factorizations [16, 26, 7]. More\nrecently, tensor factorization, a generalization of matrix factorization to higher-order data, has shown\nstate-of-the-art results for relationship prediction on multi-relational data [21, 8, 2, 13]. The number\nof latent variables in tensor factorization is determined via the number of latent components used\nin the factorization, which in turn is bounded by the factorization rank. While tensor and matrix\nfactorization algorithms scale typically well with the size of the data \u2014 which is one reason for their\nappeal \u2014 they often do not scale well with respect to the rank of the factorization. For instance,\nRESCAL is a state-of-the art relational learning method based on tensor factorization which can be\napplied to large knowledge bases consisting of millions of entities and billions of known facts [22].\n\n1\n\n\fHowever, while the runtime of the most scalable known algorithm to compute RESCAL scales\nlinearly with the number of entities, linearly with the number of relations, and linearly with the\nnumber of known facts, it scales cubical with regard to the rank of the factorization [22].1 Moreover,\nthe memory requirements of tensor factorizations like RESCAL become quickly infeasible on large\ndata sets if the factorization rank is large and no additional sparsity of the factors is enforced. Hence,\ntensor (and matrix) rank is a central parameter of factorization methods that determines generalization\nability as well as scalability. In this paper we study therefore how the rank of factorization methods\ncan be reduced while maintaining their predictive performance and scalability. We \ufb01rst analyze under\nwhich conditions tensor and matrix factorization requires high or low rank on relational data. Based\non our \ufb01ndings, we then propose an additive tensor decomposition approach to reduce the required\nrank of the factorization by combining latent and observable variable approaches.\nThis paper is organized as follows: In section 2 we develop the main theoretical results of this paper,\nwhere we show that the rank of an adjacency tensor is lower bounded by the maximum number\nof strongly connected components of a single relation and upper bounded by the sum of diclique\npartition numbers of all relations. Based on our theoretical results, we propose in section 3 a novel\ntensor decomposition approach for multi-relational data and present a scalable algorithm to compute\nthe decomposition. In section 4 we evaluate our model on various multi-relational datasets.\n\nPreliminaries We will model relational data as a directed graph (digraph), i.e. as an ordered pair\n\u0393 \u201c pV,Eq of a nonempty set of vertices V and a set of directed edges E \u010e V \u02c6 V. An existing\nedge between node vi and v j will be denoted by vi (cid:32) v j. By a slight abuse of notation, \u0393pYq will\nindicate the digraph \u0393 associated with an adjacency matrix Y P t0,1uN\u02c6N . Next, we will brie\ufb02y\nreview further concepts of tensor and graph theory that are important for the course of this paper.\nDe\ufb01nition 1. A strongly connected component of a digraph \u0393 is a maximal subgraph \u03a8 for which\nevery vertex is reachable from any other vertex in \u03a8 by following the directional edges in the subgraph.\nA strongly connected component is trivial if it consists only of a single element, i.e. if it is of the form\n\u03a8 \u201c ptviu,Hq, and nontrivial otherwise.\nWe will denote the number of strongly connected components in a digraph \u0393 by sccp\u0393q. The number\nof nontrivially connected components will be denoted by scc`p\u0393q.\nDe\ufb01nition 2. A digraph \u0393 \u201c pV,Eq is a diclique if it is an orientation of a complete undirected\nbipartite graph with bipartition pV1,V2q such that v1 P V1 and v2 P V2 for every edge v1 (cid:32) v2 P E.\nFigure 3 in supplementary material A shows an example of a diclique. Please note that dicliques\nconsist only of trivially strongly connected components, as there cannot exist any cycles in a diclique.\nGiven the concept of a diclique, the diclique partitioning number of a digraph is de\ufb01ned as:\nDe\ufb01nition 3. The diclique partition number dpp\u0393q of a digraph \u0393 \u201c pV,Eq is the minimum number\nof dicliques such that each edge e P E is contained in exactly one diclique.\nTensors can be regarded as higher-order generalizations of vectors and matrices. In the following, we\nwill only consider third-order tensors of the form X P RI\u02c6J\u02c6K , although many concepts generalize\nto higher-order tensors. The mode-n unfolding (or matricization) of X arranges the mode-n \ufb01bers\nof X as the columns of a newly formed matrix and will be denoted by Xpnq. The tensor-matrix\nproduct A \u201c X \u02c6n B multiplies the tensor X with the matrix B along the n-th mode of X such\nthat Apkq \u201c BXpkq. For a detailed introduction to tensors and these operations we refer the reader\nto Kolda et al. [12]. The k-th frontal slice of a third-order tensor X P RI\u02c6J\u02c6K will be denoted by\nXk P RI\u02c6J . The outer product of vectors will be denoted by a \u02dd b. In contrast to matrices, there exist\ntwo non-equivalent notions of the rank of a tensor:\nDe\ufb01nition 4. Let X P RI\u02c6J\u02c6K be a third-order tensor. The tensor rank t-rankpXq of X is de\ufb01ned as\nt-rankpXq \u201c mintr | X \u201c\ni\u201c1 ai \u02dd bi \u02dd ciu where ai P RI , bi P RJ , and ci P RK . The multilinear\nrank n-rankpXq of X is de\ufb01ned as the tuple pr1,r2,r3q, where ri \u201c rank\nTo model multi-relational data as tensors, we use the following concept of an adjacency tensor:\nDe\ufb01nition 5. Let G \u201c tpV,EkquK\nk\u201c1 be a set of digraphs over the same set of vertices V, where\n|V| \u201c N. The adjacency tensor of G is a third-order tensor X P t0,1uN\u02c6N\u02c6K with entries xi j k \u201c 1\nif vi (cid:32) v j P Ek and xi j k \u201c 0 otherwise.\n\n\u0159\n\n`\n\nr\n\n\u02d8\n\nXpiq\n\n.\n\n1Similar results can be obtained for state-of-the-art algorithms to compute the well-known CP and Tucker\n\ndecompositions. Please see the supplementary material A.3 for the respective derivations.\n\n2\n\n\fFor a single digraph, an adjacency tensor is equivalent to the digraph\u2019s adjacency matrix. Note that K\nwould correspond to the number of relation types in a domain.\n\n2 On the Algebraic Complexity of Graph-Structured Data\n\nIn this section, we want to identify conditions under which tensor factorization can be considered\nef\ufb01cient for relational learning. Let X denote an observed adjacency tensor with missing or noisy\nentries from which we seek to recover the true adjacency tensor Y. Rank affects both the predictive\nas well as the runtime performance of a factorization: A high factorization rank will lead to poor\nruntime performance while a low factorization rank might not be suf\ufb01cient to model Y. We are\ntherefore interested in identifying upper and lower bounds on the minimal rank \u2014 either tensor rank\nor multilinear rank \u2014 that is required such that a factorization can model the true adjacency tensor\nY. Please note that we are not concerned with bounds on the generalization error or the sample\ncomplexity that is needed to learn a good model, but on bounds on the algebraic complexity that is\nneeded to express the true underlying data via factorizations. For sign-matrices Y P t\u02d81uN\u02c6N , this\n(\nquestion has been discussed in combinatorics and communication complexity via their sign-rank\nrank\u02d8pYq, which is the minimal rank needed to recover the sign-pattern of Y:\n\n\u02c7\u02c7@i, j : sgnpmi jq \u201c yi j\n\n(cid:32)\nrankpMq\n\n(1)\n\n.\n\nrank\u02d8pYq \u201c min\nMPRN\u02c6N\n\nAlthough the concept of sign-rank can be extended to adjacency tensors, bounds based on the sign-\nrank would have only limited signi\ufb01cance for our purpose, as no practical algorithms exist to \ufb01nd\nthe solution to equation (1). Instead, we provide upper and lower bounds on tensor and multilinear\nrank, i.e. bounds on the exact recovery of Y, for the following reasons: It follows immediately\nfrom (1) that any upper-bound on rankpYq will also hold for rank\u02d8pYq since it has to hold that\nrank\u02d8pYq \u010f rankpYq. Upper bounds on rankpYq can therefore provide insight under what conditions\nfactorizations can be ef\ufb01cient on relational data \u2014 regardless whether we seek to recover exact values\nor sign patterns. Lower bounds on rankpYq provide insight under what conditions the exact recovery\nof Y can be inef\ufb01cient. Furthermore, it can be observed empirically that lower bounds on the rank are\nmore informative for existing factorization approaches to relational learning like [21, 13, 16] than\nbounds on sign-rank. For instance, let Sn \u201c 2In \u00b4 Jn be the \u201csigned identity matrix\u201d of size n, where\nIn denotes the n \u02c6 n identity matrix and Jn denotes the n \u02c6 n matrix of all ones. While it is known\nthat rank\u02d8pSnq \u201c Op1q for any size n [17], it can be checked empirically that SVD requires a rank\nlarger than n\nBased on these considerations, we state now the main theorem of this paper, which bounds the\ndifferent notions of the rank of an adjacency tensor by the diclique partition number and the number\nof strongly connected components of the involved relations:\nTheorem 1. Tensor rank t-rankpYq and multilinear rank n-rankpYq \u201c pr1,r2,r3q of any adjacency\ntensor Y P t0,1uN\u02c6N\u02c6K representing K relations t\u0393kpYkquK\n\n2 , i.e. a rank of Opnq, to recover the sign pattern of Sn.\n\n\u00ff\n\nK\nk\u201c1\n\ndpp\u0393kq \u011b \u03b8 \u011b max\n\nk\n\nk\u201c1 are bounded as\nscc`p\u0393kq,\n\nwhere \u03b8 is any of the quantities t-rankpYq, r1, or r2.\nTo prove theorem 1 we will \ufb01rst derive upper and lower bounds on adjacency matrices and then show\nhow these bounds generalize to adjacency tensors.\nLemma 1. For any adjacency matrix Y P t0,1uN\u02c6N it holds that dpp\u0393q \u011b rankpYq \u011b scc`p\u0393q.\nProof. The upper bound of lemma 1 follows directly from the fact that dpp\u0393pYqq \u201c rankNpYq and the\nfact that rankNpYq \u011b rankpYq, where rankNpYq denotes the non-negative integer rank of the binary\n(cid:3)\nmatrix Y [19, see eq. 1.6.5 and eq. 1.7.1].\nNext we will prove the lower bound of lemma 1. Let \u03bbipYq denote the i-th (complex) eigenvalue\nof Y and let \u039bpYq denote the spectrum of Y P RN\u02c6N , i.e. the multiset of (complex) eigenvalues of\nY. Furthermore, let \u03c1pYq \u201c maxi |\u03bbipYq| be the spectral radius of Y. Now, recall the celebrated\nPerron-Frobenius theorem:\nTheorem 2 ([25, Theorem 8.2]). Let Y P RN\u02c6N with yi j \u011b 0 be a non-negative irreducible matrix.\nThen \u03c1pYq \u0105 0 is a simple eigenvalue of Y associated with a positive eigenvector.\n\n3\n\n\fk\n\n\u0164\n\nPlease note that a nontrivial digraph is strongly connected iff its adjacency matrix is irreducible [3,\nTheorem 3.2.1]. Furthermore, an adjacency matrix is nilpotent iff the associated digraph is acyclic [3,\nSection 9.8]. Hence, the adjacency matrix of a strongly connected component \u03a8 is nilpotent iff \u03a8 is\ntrivial. Given these considerations, we can now prove the lower bound of lemma 1:\nLemma 2. For any non-negative adjacency matrix Y P RN\u02c6N with yi j \u011b 0 of a weighted digraph \u0393\nit holds that rankpYq \u011b scc`p\u0393q.\nProof. Let \u0393 consist of k nontrivial strongly connected components. The Frobenius normal form B\nof its associated adjacency matrix Y consists then of k irreducible matrices Bi on its block diagonal.\nIt follows from theorem 2 that each irreducible Bi has at least one nonzero eigenvalue. Since B is\nblock upper triangular, it holds also that \u039bpBq \u201c\ni\u201c1 \u039bpBiq. As the rank of a square matrix is\nlarger or equal to the number of its nonzero eigenvalues, it follows that rankpBq \u011b k. Lemma 2\n(cid:3)\nfollows from the fact that B is similar to Y and that matrix similarity preserves rank.\nSo far, we have shown that rankpYq of an adjacency matrix Y is bounded by the diclique covering\nnumber and the number of nontrivial strongly connected components of the associated digraph. To\ncomplete the proof of theorem 1 we will now show that these bounds for unirelational data translate\ndirectly to multi-relational data and to the different notions of the rank of an adjacency tensor. In\nparticular we will show that both notions of tensor rank are lower bounded by the maximum rank of\na single frontal slice in the tensor and upper bounded by the sum of the ranks of all frontal slices:\nLemma 3. The tensor rank t-rankpYq and multilinear rank n-rankpYq \u201c pr1,r2,r3q of any third-order\ntensor Y P RI\u02c6J\u02c6K with frontal slices Yk are bounded as\nrankpYkq \u011b \u03b8 \u011b max\nrankpYkq,\n\u0159\n\n\u00ff\n\nK\nk\u201c1\n\nk\n\n`\u0159\n\nrmax \u201c rank\n\n`\ni\u201c1 ck r ar bJ\n\ni and all other entries are 0. It can be easily veri\ufb01ed that\n\nwhere \u03b8 is any of the quantities t-rankpYq, r1, or r2.\nProof. Due to space constraints, we will include only the proof for tensor rank. The proof for\nmultilinear rank can be found in supplementary material A.1. Let t-rankpYq \u201c r and rankpYkq \u201c rmax.\nIt can be seen from the de\ufb01nition of tensor rank that Yk \u201c\nr q. Consequently, it follows\n\u02d8\n\u02d8\n\u0159\nfrom the subadditivity of matrix rank, i.e. rankpA ` Bq \u010f rankpAq ` rankpBq, that\n\u02d8\n\u010f\nck r ar bJ\n\ni\u201c1 ck rpar bJ\n`\nck r ar bJ\n\u0159\ni\u201c1 rank\n\u010f 1. Now we will derive the upper bound\nwhere the last inequality follows from rank\nk rankpYkq that recovers Y exactly.\nof lemma 3 by providing a decomposition of Y with rank r \u201c\n\u0159\nk be the SVD of Yk with Sk \u201c diagpskq. Furthermore, let U \u201c rU1 U2 \u00a8\u00a8\u00a8 UKs,\nLet Yk \u201c Uk SkVJ\nV \u201c rV1 V2 \u00a8\u00a8\u00a8 VKs, and let S be a block-diagonal matrix where the i-th block on the diagonal is\ni\u201c1 \u02c6ui \u02dd \u02c6vi \u02dd \u02c6si provides an exact\nequal to sJ\ndecomposition of Y, where r \u201c\nk rankpYkq and \u02c6ui, \u02c6vi, and \u02c6si are the i-th columns of the matrices\n(cid:3)\nU, V, and S. The inequality in lemma 3 follows since r is not necessarily minimal.\n\u0159\nTheorem 1 can now be derived by combining lemmas 1 and 3 what concludes the proof.\n\nDiscussion It can be seen from theorem 1 that factorizations can be computationally ef\ufb01cient when\nk dpp\u0393kq is small. However, factorizations can potentially be inef\ufb01cient when scc`p\u0393kq is large\nfor any \u0393k in the data. For instance, consider an idealized marriedTo relation, where each person is\nmarried to exactly one person. Evidently, for m marriages, the associated digraph would consist of\nm strongly connected components, i.e. one component for each marriage. According to lemma 2,\na factorization model would at least require m latent components to recover this adjacency matrix\nexactly. Consequently, an algorithm with cubic runtime complexity in the rank would only be able\nto recover Y for this relation when the number of marriages is small, what limits its applicability\nto these relations. A second important observation for multi-relational learning is that the lower\nbound in theorem 1 depends only on the largest rank of a single frontal slice (i.e. a single adjacency\nmatrix) in Y. For multi-relational learning this means that regularities between different relations\ncan not decrease tensor or multilinear rank below the largest matrix rank of a single relation. For\ninstance, consider an N \u02c6 N \u02c6 2 tensor Y where Y1 \u201c Y2. Clearly it holds that rankpYp3qq \u201c 1, such\nthat Y1 could easily be predicted from Y2 when Y2 is known. However, theorem 1 states that the rank\nof the factorization must be at least rankpY1q \u2014 which can be arbitrarily large up to N \u2014 when\n\n\u010f r\n\nr\n\nr\n\nr\n\nr\n\nr\n\nr\n\nr\n\n\u0159\n\n4\n\n\fthe \ufb01rst two modes of Y are also factorized. Please note that this is not a statement about sample\ncomplexity or generalization error which can be reduced when factorizing all modes of a tensor, but\na statement about the minimal rank that is required to express the data. A last observation from the\nprevious discussion is that factorizations and observable variable methods excel at different aspects\nof relationship prediction. For instance, predicting relationships in the idealized marriedTo relation\ncan be done easily with Horn clauses and link predication heuristics as listed in supplementary\nmaterial A.2. In contrast, factorization methods would be inef\ufb01cient in predicting links in this relation\nas they would require at least one latent component for each marriage. At the same time, links in a\ndiclique of any size can trivially be modeled with a rank-2 factorization that indicates the partition\nmemberships, while standard neighborhood-based methods will fail on dicliques since \u2014 by the\nde\ufb01nition of a diclique \u2014 there do not exist links within one partition yet the only vertices that share\nneighbors are located in the same partition.\n\n3 An Additive Relational Effects Model\n\nRESCAL is a state-of-the-art relational learning method that is based on a constrained Tucker-\ndecomposition and as such is subject to bounds as in theorem 1. Motivated by the results of\nsection 2, we propose an additive tensor decomposition approach to combine the strengths of latent\nand observable variable methods to reduce the rank requirements of RESCAL on multi-relational\ndata. To include the information of observable pattern methods in the factorization, we augment the\nRESCAL model with an additive term that holds the predictions of observable pattern methods. In\nparticular, let X P t0,1uN\u02c6N\u02c6K be a third-order adjacency tensor and M P RN\u02c6N\u02c6P be a third-order\ntensor that holds the predictions of an arbitrary number of relational learning methods. The proposed\nadditive relational effects model (ARE) decomposes X into\n\ni Rka j `\n\nP\n\np\u201c1 wk pmi j p.\n\nX \u00ab R \u02c61 A \u02c62 A ` M \u02c63 W,\n\nsingle relationship is calculated in ARE viapxi j k \u201c aT\n\n(2)\nwhere A P RN\u02c6r , R P Rr\u02c6r\u02c6K and W P RK\u02c6P. The \ufb01rst term of equation (2) corresponds to\nthe RESCAL model which can be interpreted as following: The matrix A holds the latent variable\nrepresentations of the entities, while each frontal slice Rk of R is an asymmetric r \u02c6 r matrix that\nmodels the interactions of the latent components for the k-th relation. The variable r denotes the\nnumber of latent components of the factorization. An important aspect of RESCAL for relational\nlearning is that entities have a unique latent representation via the matrix A. This enables a relational\nlearning effect via the propagation of information over different relations and the occurrences of\nentities as a subject or objects in relationships. For a detailed description of RESCAL we refer the\nreader to Nickel et al. [21, 22]. After computing the factorization (2), the score for the existence of a\n\n\u0159\nThe construction of the tensor M is of the following: Let F \u201c t f puP\np\u201c1 be a set of given real-valued\nfunctions f p : V \u02c6 V \u00d1 R which assign scores to each pair of entities in V. Examples of such score\nfunctions include link prediction heuristics such as Common Neighbors, Katz Centrality, or Horn\nclauses. Depending on the underlying model these scores can be interpreted as con\ufb01dences value or as\nprobabilities that a relationship exists between two entities. We collect these real-valued predictions\nof P score functions in the tensor M P RN\u02c6N\u02c6P by setting mi j p \u201c f ppvi , v jq. Supplementary\nmaterial A.2 provides a detailed description of the construction of M for typical score functions. The\ntensor M acts in the factorization as an independent source of information that predicts the existence\nof relationships. The term M\u02c63 W can be interpreted as learning a set of weights wk p which indicate\nhow much the p-th score function in M correlates with the k-th relation in X. For this reason we refer\nto M also as the oracle tensor. If M is composed of relation path features as proposed by Lao et al.\n[15], the term MW is closely related to the Path Ranking Algorithm (PRA) [15].\nThe main idea of equation (2) is the following: The term R \u02c61 A \u02c62 A is equivalent to the RESCAL\nmodel and provides an ef\ufb01cient approach to learn from latent patterns on relational data. The oracle\ntensor M on the other hand is not factorized, such that it can hold information that is dif\ufb01cult to\npredict via latent variable methods. As it is not clear a priori which score functions are good predictors\nfor which relations, the term M \u02c63 W learns a weighting of how predictive any score function is for\nany relation. By integrating both terms in an additive model, the term M \u02c63 W can potentially reduce\nthe required rank for the RESCAL term by explaining links that, for instance, reduce the diclique\npartition number of a digraph. Rules and operations that are likely to reduce the diclique partition\n\n5\n\n\f\u0159\n\nnumber of slices in X are therefore good candidates to be included in M. For instance, by including a\ncopy of the observed adjacency tensor X in M (or some selected frontal slices Xk), the term M \u02c63 W\ncan easily model common multi-relational patterns where the existence of a relationship in one\nrelation correlates with the existence of a relationship between the same entities in another relation\nvia xi j k \u201c\np\u2030k wk p xi j p. Since wk p is allowed to be negative, anti-correlations can be modeled\nef\ufb01ciently. ARE is similar in spirit to the model of Koren [14], which extends SVD with additive\nterms to include local neighborhood information in an uni-relational recommendation setting and\nJiang et al. [9] which uses an additive matrix factorization model for link prediction. Furthermore, the\nrecently proposed Google Knowledge Vault (KV) [5] considers a combination of PRA and a neural\nnetwork model related to RESCAL for learning from large multi-relational datasets. However, in KV\nboth models are trained separately and combined only later in a separate fusion step, whereas ARE\nlearns both models jointly what leads to the desired rank-reduction effect.\nTo compute ARE, we pursue a similar optimization scheme as used for RESCAL which has been\nshown to scale to large datasets [22]. In particular, we solve the regularized optimization problem\n\n}X \u00b4 pR \u02c61 A \u02c62 A ` M \u02c63 Wq}2\n\nF ` \u03bb A}A}2\n\nF ` \u03bb R}R}2\n\nF ` \u03bbW}W}2\nF .\n\n(3)\n\nmin\nA,R,W\n\nvia alternating least-squares, which is a block-coordinate optimization method in which blocks of\nvariables are updated alternatingly until convergence. For equation (3) the variable blocks are given\nnaturally by the factors A, R, and W.\nUpdates for W Let E \u201c pX \u00b4 R \u02c61 A \u02c62 Aq and I be the identity matrix. We rewrite equation (2)\nas Ep3q \u00ab W Mp3q such that equation (3) becomes a regularized least-squares problem when solving\nfor W. It follows that updates for W can be computed via W \u00d0 pMp3qMJ\np3q ` \u03bbW Iq\u00b41Mp3qEJ\np3q.\nHowever, performing the updates in this way would be very inef\ufb01cient as it involves the computation\nof the dense N \u02c6 N \u02c6 K tensor R \u02c61 A \u02c62 A. This would quickly lead to scalability issues with\nregard to runtime and memory requirements. To overcome this issue, we rewrite Mp3qEJ\np3q using the\nequality pR \u02c61 A \u02c62 Aqp3qMJ\np3q. Updates for W can then be computed\nef\ufb01ciently as\n\np3q \u201c Rp3qpM \u02c61 AJ \u02c62 AJqJ\n\nWJ \u00d0\n\nXp3qMJ\n\np3q \u00b4 Rp3qpM \u02c61 AJ \u02c62 AJqJ\np3q\n\n(4)\nIn equation (4) the dense tensor R \u02c61 A \u02c62 A is never computed explicitly and the computational\ncomplexity with regard to the parameters N, K, and r is reduced from OpN 2Krq to OpN Kr3q.\nFurthermore, all terms in equation (4) except Rp3qpM \u02c61 AJ \u02c62 AJqJ\np3q are constant and have only to\np3q and Mp3qMJ\nbe computed once at the beginning of the algorithm. Finally, Xp3qMJ\np3q are the products\nof sparse matrices such that their computational complexity depends only on the number of nonzeros\nin X or M. A full derivation of equation (4) can be found in the supplementary material A.4.\n\npMp3qMJ\n\np3q ` \u03bbW Iq\u00b41.\n\n\u201d\n\n\u0131\n\nA \u00d0\n\n\u00b4\u00ff\n\n\u00af\u00b4\u00ff\n\nUpdates for A and R The updates for A and R can be derived directly from the RESCAL-ALS\nalgorithm by setting E \u201c X \u00b4 M \u02c63 W and computing the RESCAL factorization of E. The updates\nfor A can therefore be computed by:\nk ` EJ\n\nk AJ ARk ` \u03bbI\nRk AJ ARJ\nwhere Ek \u201c Xk \u00b4 M \u02c63 wk and wk denotes the k-th row of W.\n`\n\u02d8\nThe updates of R can be computed in the following way: Let A \u201c U\u03a3VJ be the SVD of A, where \u03c3i\nis the i-th singular value of A. Furthemore, let S be a matrix with entries si j \u201c \u03c3i \u03c3 j{p\u03c32\nj ` \u03bb Rq.\nVJ, where \u201c\u02da\u201d\nS \u02da pUJpXk \u00b4 M \u02c63 wkqUq\nAn update of Rk can then be computed via Rk \u00d0 V\ndenotes the Hadamard product. For a full derivation of these updates please see [20].\n\nk ` RJ\n\n\u00af\u00b41\n\nEk ARJ\n\nk ARk\n\nK\nk\u201c1\n\nK\nk\u201c1\n\ni \u03c32\n\n4 Evaluation\n\nWe evaluated ARE on various multi-relational datasets where we were in particular interested in its\ngeneralization ability relative to the factorization rank. For comparison, we included the well-known\n\n6\n\n\f(a) Kinships\n\n(b) PoliticalDiscussant\n\n(c) CloseFriend\n\n(d) BlogLiveJournalTwitter\n\n(e) SocializeTwicePerWeek\n\n(f) FacebookAllTaggedPhotos\n\nFigure 1: Evaluation results for AUC-PR on the Kinships (1a) and Social Evolution data sets (1b-1f).\n\nCP and Tucker tensor factorizations in the evaluation, as well as RESCAL and the non-latent model\nX \u00ab M\u02c63 W (in the following denoted by MW). In all experiments, the oracle tensor M used in MW\nand ARE is identical, such that the results of MW can be regarded as a baseline for the contribution\nof the heuristic methods to ARE. Following [10, 11, 28, 21] we used k-fold cross-validation for the\nevaluation, partitioning the entries of the adjacency tensor into training, validation, and test sets. In\nthe test and validation folds all entries are set to 0. Due to the large imbalance of true and false\nrelationships, we used the area under the precision-recall curve (AUC-PR) to measure predictive\nperformance, which is known to behave better with imbalanced classes then AUC-ROC. All AUC-PR\nresults are averaged over the different test-folds. Links and references for the datasets used in the\nevaluation are provided in the supplementary material A.5.\n\nSocial Evolution First, we evaluated ARE on a dataset consisting of multiple relations of persons\nliving in an undergraduate dormitory. From the relational data, we constructed a 84\u02c684\u02c65 adjacency\ntensor where two modes correspond to persons and the third mode represents the relations between\nthese persons such as friendship (CloseFriend), social media interaction (BlogLivejournalTwitter\nand FacebookAllTaggedPhotos), political discussion (PoliticalDiscussant), and social interaction\n(SocializeTwicePerWeek). For each relation, we performed link prediction via 5-fold cross validation.\nThe oracle tensor M consisted only of a copy of the observed tensor X. Including X in M allows\nARE to ef\ufb01ciently exploit patterns where the existence of a social relationship for a particular pair\nof persons is predictive for other social interactions between exactly this pair of persons (e.g. close\nfriends are more likely to socialize twice per week). It can be seen from the results in \ufb01gure 1(b \u00b4 f )\nthat ARE achieves better performance than all competing approaches and already achieves excellent\nperformance at a very low rank, what supports our theoretical considerations.\n\nKinship The Kinship dataset describes the kinship relations in the Australian Alyawarra tribe\nin terms of 26 kinship relations between 104 persons. The task in the experiment was to predict\nunknown kinship relations via 10-fold cross validation in the same manner as in [21]. Table 1 shows\nthe improvement of ARE over state-of-the-art relational learning methods. Figure 1a shows the\npredictive performance compared to the rank of multiple factorization methods. It can be seen that\nARE outperforms all other methods signi\ufb01cantly for lower rank. Moreover, starting from rank 40\nARE gives already comparable results to the best results in table 1. As in the previous experiments,\nM consisted only of a copy of X. On this dataset, the copy of X allows ARE to model ef\ufb01ciently that\nthe relations in the data are mutually exclusive by setting wii \u0105 0 and wi j \u0103 0 for all i \u2030 j. This\nalso explains the large improvement of ARE over RESCAL for small ranks.\n\n7\n\n102030405060708090100102030405060708090100RankAera under Precision\u2212Recall Curve CPTuckerMWRESCALARE51015202530556065707580859095Rank CPTuckerMWRESCALARE510152025304550556065707580Rank CPTuckerMWRESCALARE5101520253080859095100RankAera under Precision\u2212Recall Curve CPTuckerMWRESCALARE51015202530707580859095Rank CPTuckerMWRESCALARE510152025309095100Rank CPTuckerMWRESCALARE\fLink Prediction on Semantic Web Data The SWRC ontology models a research group in terms\nof people, publications, projects, and research interests. The task in our experiments was to predict\nthe af\ufb01liation relation, i.e. to map persons to research groups. We followed the experimental setting\nin [18]: From the raw data, we created a 12058 \u02c6 12058 \u02c6 85 tensor by considering all directly\nconnected entities of persons and research groups. In total, 168 persons and 5 research groups are\nconsidered in the evaluation data. The oracle tensor M consisted again of a copy of X and of the\ncommon neighbor heuristics Xi Xi and XJ\ni . These heuristics were included to model patterns like\npeople who share the same research interest are likely in the same af\ufb01liation or a person is related\nto a department if the person belongs to a group in the department. We also imposed a sparsity\npenalty on W to prune away inactive heuristics during iterations. Table 2 shows that ARE improved\nthe results signi\ufb01cantly over three state-of-the-art link prediction methods for Semantic Web data.\nMoreover, whereas RESCAL required a rank of 45, ARE required only a small rank of 15.\n\ni XJ\n\nFigure 2: Runtime on Cora\n\nAUC\nRank\n\nTable 1: Evaluation Results on Kinships.\nMRC BCTF\n[28]\n[11]\n90\n86\n-\n-\n\nLFM\n[8]\n94.6\n\n(50,50,500)\n\n96\n100\n\nRESCAL ARE\n\n96.9\n90\n\nTable 2: Evaluation results on SWRC.\n\nSVD Subtrees [18]\n0.8\n\n0.95\n\nnDCG\n\nRESCAL MW ARE\n0.99\n\n0.96\n\n0.59\n\ni XJ\n\nRuntime Performance To evaluate the trade-off between runtime and predictive performance\nwe recorded the nDCG values of RESCAL and ARE after each iteration of the respective ALS\nalgorithms on the Cora citation database. We used the variant of Cora in which all publications are\norganized in a hierarchy of topics with two to three levels and 68 leaves. The relational data consists\nof information about paper citations, authors and topics from which a tensor of size 28073\u02c628073\u02c63\nis constructed. The oracle tensor consisted of a copy of X and the common neighbor patterns Xi X j\nand XJ\nto model patterns such that a cited paper shares the same topic, a cited paper shares\nthe same author etc. The task of the experiment was to predict the leaf topic of papers by 5-fold\ncross-validation on a moderate PC with Intel(R) Core i5 @3.1GHz, 4G RAM. The optimal rank 220\nfor RESCAL was determined out of the range r10,300s via parameter selection. For ARE we used a\nsigni\ufb01cantly smaller rank 20. Figure 2 shows the runtime of RESCAL and ARE compared to their\npredictive performance. It is evident that ARE outperforms RESCAL after a few iterations although\nthe rank of the factorization is decreased by an order of magnitude. Moreover, ARE surpasses\nthe best prediction results of RESCAL in terms of total runtime even before the \ufb01rst iteration of\nRESCAL-ALS has terminated.\n\nj\n\n5 Concluding Remarks\n\nIn this paper we considered learning from latent and observable patterns on multi-relational data.\nWe showed analytically that the rank of adjacency tensors is upper bounded by the sum of diclique\npartition numbers and lower bounded by the maximum number of strongly connected components of\nany relation in the data. Based on our theoretical results, we proposed an additive tensor factorization\napproach for learning from multi-relational data which combines strengths from latent and observable\nvariable methods. Furthermore we presented an ef\ufb01cient and scalable algorithm to compute the\nfactorization. Experimentally we showed that the proposed approach does not only increase the\npredictive performance but is also very successful in reducing the required rank \u2014 and therefore also\nthe required runtime \u2014 of the factorization. The proposed additive model is one option to overcome\nthe rank-scalability problem outlined in section 2, however not the only one. In future work we intend\nto investigate to what extent sparse or hierarchical models can be used to the same effect.\n\nAcknowledgements Maximilian Nickel acknowledges support by the Center for Brains, Minds and Ma-\nchines (CBMM), funded by NSF STC award CCF-1231216. We thank Youssef Mroueh and Lorenzo Rosasco\nfor clarifying discussions on the theoretical part of this paper.\n\n8\n\n10\u22121100101102Time(s)0.700.720.740.760.780.800.820.84nDCGRESCALARE\fReferences\n\n[1] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. \u201cMixed Membership Stochastic Blockmodels\u201d.\n\nIn: Journal of Machine Learning Research 9 (2008), pp. 1981\u20132014.\n\n[2] A. Bordes, J. Weston, R. Collobert, and Y. Bengio. \u201cLearning Structured Embeddings of Knowledge\n\nBases\u201d. In: Proceedings of the 25th Conference on Arti\ufb01cial Intelligence. 2011.\n\n[3] R. A. Brualdi and H. J. Ryser. Combinatorial Matrix Theory. 1991.\n[4] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, Jr, and T. Mitchell. \u201cToward an Architecture for Never-\n\nEnding Language Learning\u201d. In: AAAI. 2010, pp. 1306\u20131313.\n\n[5] X. L. Dong, K. Murphy, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, T. Strohmann, S. Sun, and W. Zhang.\n\u201cKnowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion\u201d. In: Proceedings of the\n20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2014.\n\n[6] L. Getoor, N. Friedman, D. Koller, A. Pfeffer, and B. Taskar. \u201cProbabilistic Relational Models\u201d. In:\n\nIntroduction to statistical relational learning. 2007, pp. 129\u2013174.\nP. D. Hoff. \u201cModeling homophily and stochastic equivalence in symmetric relational data\u201d. In: Advances\nin Neural Information Processing Systems. Vol. 20. 2008, pp. 657\u2013664.\n\n[7]\n\n[8] R. Jenatton, N. Le Roux, A. Bordes, and G. Obozinski. \u201cA latent factor model for highly multi-relational\n\ndata\u201d. In: Advances in Neural Information Processing Systems. Vol. 25. 2012, pp. 3176\u20133184.\n\n[9] X. Jiang, V. Tresp, Y. Huang, and M. Nickel. \u201cLink Prediction in Multi-relational Graphs using Additive\nModels.\u201d In: Proceedings of International Workshop on Semantic Technologies meet Recommender\nSystems & Big Data at the ISWC. Vol. 919. 2012, pp. 1\u201312.\n\n[10] C. Kemp, J. B. Tenenbaum, T. L. Grif\ufb01ths, T. Yamada, and N. Ueda. \u201cLearning systems of concepts with\n\nan in\ufb01nite relational model\u201d. In: AAAI. Vol. 3. 2006, p. 5.\n\n[11] S. Kok and P. Domingos. \u201cStatistical Predicate Invention\u201d. In: Proceedings of the 24th International\n\nConference on Machine Learning. 2007, pp. 433\u2013440.\n\n[12] T. G. Kolda and B. W. Bader. \u201cTensor Decompositions and Applications\u201d. In: SIAM Review 51.3 (2009),\n\npp. 455\u2013500.\n\n[13] T. G. Kolda, B. W. Bader, and J. P. Kenny. \u201cHigher-order web link analysis using multilinear algebra\u201d. In:\n\nProceedings of the Fifth International Conference on Data Mining. 2005, pp. 242\u2013249.\n\n[14] Y. Koren. \u201cFactorization meets the neighborhood: a multifaceted collaborative \ufb01ltering model\u201d. In:\nProceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining. 2008, pp. 426\u2013434.\n\n[15] N. Lao and W. W. Cohen. \u201cRelational retrieval using a combination of path-constrained random walks\u201d.\n\nIn: Machine learning 81.1 (2010), pp. 53\u201367.\n\n[16] D. Liben-Nowell and J. Kleinberg. \u201cThe link-prediction problem for social networks\u201d. In: Journal of the\n\nAmerican society for information science and technology 58.7 (2007), pp. 1019\u20131031.\n\n[17] N. Linial, S. Mendelson, G. Schechtman, and A. Shraibman. \u201cComplexity measures of sign matrices\u201d. In:\n\nCombinatorica 27.4 (2007), pp. 439\u2013463.\n\n[18] U. L\u00f6sch, S. Bloehdorn, and A. Rettinger. \u201cGraph Kernels for RDF Data\u201d. In: The Semantic Web:\nResearch and Applications - 9th Extended Semantic Web Conference, ESWC 2012. Vol. 7295. 2012,\npp. 134\u2013148.\nS. D. Monson, N. J. Pullman, and R. Rees. \u201cA survey of clique and biclique coverings and factorizations\nof (0,1)-matrices\u201d. In: Bulletin of the ICA 14 (1995), pp. 17\u201386.\n\n[19]\n\n[20] M. Nickel. \u201cTensor factorization for relational learning\u201d. PhD thesis. LMU M\u00fcnchen, 2013.\n[21] M. Nickel, V. Tresp, and H.-P. Kriegel. \u201cA Three-Way Model for Collective Learning on Multi-Relational\nData\u201d. In: Proceedings of the 28th International Conference on Machine Learning. 2011, pp. 809\u2013816.\n[22] M. Nickel, V. Tresp, and H.-P. Kriegel. \u201cFactorizing YAGO: scalable machine learning for linked data\u201d.\n\nIn: Proceedings of the 21st international conference on World Wide Web. 2012, pp. 271\u2013280.\n[23]\nJ. R. Quinlan. \u201cLearning logical de\ufb01nitions from relations\u201d. In: Machine Learning 5 (1990), pp. 239\u2013266.\n[24] M. Richardson and P. Domingos. \u201cMarkov logic networks\u201d. In: Machine Learning 62.1 (2006), pp. 107\u2013\n\n136.\n\n[27]\n\n[25] D. Serre. Matrices: Theory and applications. Vol. 216. 2010.\n[26] A. P. Singh and G. J. Gordon. \u201cRelational learning via collective matrix factorization\u201d. In: Proc. of the\n14th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining. 2008, pp. 650\u2013658.\nF. M. Suchanek, G. Kasneci, and G. Weikum. \u201cYago: A Core of Semantic Knowledge\u201d. In: Proceedings\nof the 16th international conference on World Wide Web. 2007, pp. 697\u2013706.\nI. Sutskever, R. Salakhutdinov, and J. Tenenbaum. \u201cModelling Relational Data using Bayesian Clustered\nTensor Factorization\u201d. In: Advances in Neural Information Processing Systems 22. 2009, pp. 1821\u20131828.\n[29] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. \u201cIn\ufb01nite Hidden Relational Models\u201d. In: Proc. of the Twenty-\n\n[28]\n\nSecond Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence. 2006, pp. 544\u2013551.\n\n9\n\n\f", "award": [], "sourceid": 684, "authors": [{"given_name": "Maximilian", "family_name": "Nickel", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Xueyan", "family_name": "Jiang", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen"}, {"given_name": "Volker", "family_name": "Tresp", "institution": "Siemens AG"}]}