{"title": "Using matrices to model symbolic relationship", "book": "Advances in Neural Information Processing Systems", "page_first": 1593, "page_last": 1600, "abstract": "We describe a way of learning matrix representations of objects and relationships. The goal of learning is to allow multiplication of matrices to represent symbolic relationships between objects and symbolic relationships between relationships, which is the main novelty of the method. We demonstrate that this leads to excellent generalization in two different domains: modular arithmetic and family relationships. We show that the same system can learn first-order propositions such as $(2, 5) \\member +\\!3$ or $(Christopher, Penelope)\\member has\\_wife$, and higher-order propositions such as $(3, +\\!3) \\member plus$ and $(+\\!3, -\\!3) \\member inverse$ or $(has\\_husband, has\\_wife)\\in higher\\_oppsex$. We further demonstrate that the system understands how higher-order propositions are related to first-order ones by showing that it can correctly answer questions about first-order propositions involving the relations $+\\!3$ or $has\\_wife$ even though it has not been trained on any first-order examples involving these relations.", "full_text": "Using matrices to model symbolic relationships\n\nIlya Sutskever and Geoffrey Hinton\n\nUniversity of Toronto\n\n{ilya, hinton}@cs.utoronto.ca\n\nAbstract\n\nWe describe a way of learning matrix representations of objects and relationships.\nThe goal of learning is to allow multiplication of matrices to represent symbolic\nrelationships between objects and symbolic relationships between relationships,\nwhich is the main novelty of the method. We demonstrate that this leads to ex-\ncellent generalization in two different domains: modular arithmetic and family\nrelationships. We show that the same system can learn \ufb01rst-order propositions\nsuch as (2, 5) \u2208 +3 or (Christopher, Penelope) \u2208 has wife, and higher-order\npropositions such as (3, +3) \u2208 plus and (+3, \u22123) \u2208 inverse or (has husband,\nhas wife) \u2208 higher oppsex. We further demonstrate that the system understands\nhow higher-order propositions are related to \ufb01rst-order ones by showing that it can\ncorrectly answer questions about \ufb01rst-order propositions involving the relations\n+3 or has wife even though it has not been trained on any \ufb01rst-order examples\ninvolving these relations.\n\n1 Introduction\n\nIt is sometimes possible to \ufb01nd a way of mapping objects in a \u201cdata\u201d domain into objects in a \u201ctarget\u201d\ndomain so that operations in the data domain can be modelled by operations in the target domain.\nIf, for example, we map each positive number to its logarithm, multiplication in the data domain can\nbe modelled by addition in the target domain. When the objects in the data and target domains are\nmore complicated than single numbers, it may be dif\ufb01cult to \ufb01nd good mappings using inspiration\nalone. If we consider a continuous space of possible mappings and if we de\ufb01ne a smooth measure of\nhow well any particular mapping works, it is possible to use gradient search to \ufb01nd good mappings\nbetween the data and target domains.\n\nPaccanaro and Hinton [10] introduced a method called \u201cLinear Relational Embedding\u201d (LRE) that\nuses multiplication of vectors by matrices in the target domain to model pairwise relations between\nobjects in the data domain. LRE applies to a \ufb01nite set of objects \u2126 and a \ufb01nite set of relations\nR where every relation R \u2208 R is a set of pairs of objects, so R \u2286 \u2126 \u00d7 \u2126. Given the objects\nand relations, LRE \ufb01nds a column-vector representation A of each object A \u2208 \u2126, and a matrix\nrepresentation R of each relation R \u2208 R, such that the product RA is close to B for all pairs\n(A, B) that are members of the relation R, and far from C for all pairs (A, C) that are not members\nof R. LRE learns the vectors and matrices by performing gradient descent in a cost function C that\nmeasures the similarities between RA and all B such that (A, B) \u2208 R relative to the similarities\nbetween RA and the vector representations of all the objects in the set of known objects \u2126:\n\nC = \u2212 X\n\nX\n\nlog\n\nR\n\n(A,B)\u2208R\n\nexp(\u2212kRA \u2212 Bk2)\n\nPC\u2208\u2126 exp(\u2212kRA \u2212 Ck2)\n\n(1)\n\nThe cost function in Eq. 1 is \u201cdiscriminative\u201d because it compares the distance from RA to each\ncorrect answer with the distances from RA to all possible answers. This prevents trivial solutions\n\n\fin which RA and B are always zero, but it also causes the cost function to be nonconvex, making\nit hard to optimize. We can view exp(\u2212kRA \u2212 Bk2) as the unnormalized probability density of\nB under a spherical Gaussian centered at RA. The cost function then represents the sum of the\nnegative log probabilities of picking the correct answers to questions of the form (A,?) \u2208 R if we\npick answers stochastically in proportion to their probability densities under the spherical Gaussian\ncentered at RA.\n\nWe say that LRE accurately models a set of objects and relations if its answers to queries of the\nform (A, ?) \u2208 R are correct, which means that for each object A and relation R such that there are\nk objects X satisfying (A, X) \u2208 R, each vector representation X of each such object X must be\namong the k closest vector representations to RA. The de\ufb01nition of correctness implies that LRE\u2019s\nanswer to a query (A, ?) \u2208 R that has no solutions is always trivially correct. More re\ufb01ned versions\nof LRE handle such unsatis\ufb01able queries more explicitly [9].\n\nIt may not be obvious how to determine if the representation found by LRE is good. One way is\nto check if LRE\u2019s representation generalizes to test data. More speci\ufb01cally, if LRE has not been\ninformed that B is an answer to the query (A, ?) \u2208 R that has k correct answers (that is, (A, B) was\nremoved from R during LRE\u2019s learning), yet LRE answers the query (A, ?) \u2208 R correctly by placing\nB among the k closest object representations to RA, then we can claim that LRE\u2019s representation\ngeneralizes. Such generalization can occur only if LRE learned the \u201cright\u201d representations A, B,\nand R from the other propositions, which can happen only if the true relation is plausible according\nto LRE\u2019s inductive bias that determines the subjective plausibility of every possible set of objects\nand relations (see, e.g., [6]). If the representation is high-dimensional, then LRE can easily represent\nany set of relations that is not too large, so its inductive bias \ufb01nds all sets of relations plausible, which\nprevents generalization from being good. However, if the representation is low-dimensional, then\nLRE must make use of regularities in the training set in order to accurately model the data, but if\nit succeeds in doing so, generalization will be good. Paccanaro and Hinton [10] show that low-\ndimensional LRE exhibits excellent generalization on datasets such as the family relations task. In\ngeneral, the dimensionality of the representation should grow with the total numbers of objects and\nrelations, because when there are few objects and relations, a high-dimensional representation easily\nover\ufb01ts, but if the number of objects and relations is large then the dimensionality can be higher,\nwithout over\ufb01tting. The best dimensionality depends on the \u201c\ufb01t\u201d between LRE and the data, and is\nmainly an empirical question.\n\nA drawback of LRE is that the square matrices it uses to represent relations are quadratically more\ncumbersome than the vectors it uses to represent objects. This causes the number of free parameters\nto grow rapidly when the dimensionality of the representations is increased. More importantly, it\nalso means that relations cannot themselves be treated as objects. Paccanaro and Hinton [10], for\nexample, describe a system that learns propositions of the form: (2, 5) \u2208 +3 where +3 is a relation\nthat is represented by a learned matrix, but their system does not understand that the learned matrix\nfor +3 has anything in common with the learned vector that is used to model the number 3 in\npropositions like (5, 3) \u2208 \u22122.\nIn this paper we describe \u201cMatrix Relational Embedding\u201d (MRE), which is a version of LRE that\nuses matrices as the representation for objects as well as for relations.1 MRE optimizes the same\ncost function as LRE (equation 1), with the difference that RA \u2212 C is now a matrix rather than a\nvector and kRA \u2212 Ck2 denotes the sum of the squares of the entries of the matrix. This choice\nof matrix norm makes MRE a direct generalization of LRE. All distances between matrices will be\ncomputed using this norm.\n\nAlthough MRE is a simple variation of LRE, it has two important advantages.\n\nThe \ufb01rst advantage of MRE is that when using an N \u00d7 N matrix to represent each object it is\npossible to make N much smaller than when using an N-dimensional vector, so MRE can use about\nthe same number of parameters as LRE for each object but many fewer parameters than LRE for\neach relation, which is useful for \u201csimple\u201d relations.\n\n1We have also experimented with a version of LRE that learns to generate a learned matrix representation of\na relation from a learned vector representation of the relation. This too makes it possible to treat relations as ob-\njects because they both have vector representations. However, it is less straightforward than simply representing\nobjects by matrices and it does not generalize quite as well.\n\n\fThe second advantage of MRE, which is also the main novelty of this paper, is that MRE is\ncapable of representing higher-order relations, instances of which are (+3, \u22123) \u2208 inverse or\n(has husband, has wif e) \u2208 higher oppsex. It can also represent relations involving an object\nand a relation, for instance (3, +3) \u2208 plus. Formally, we are given a \ufb01nite set of higher-order rela-\ntions \u02dcR, where a higher-order relation \u02dcR \u2208 \u02dcR is a relation whose arguments can be relations as well\nas objects, which we formalize as \u02dcR \u2286 R \u00d7 R or \u02dcR \u2286 \u2126 \u00d7 R (R is the set of the basic relations).\nThe matrix representation of MRE allows it to treat relations in (almost) the same way it treats basic\nobjects, so there is no dif\ufb01culty representing relations whose arguments are also relations.\n\nWe show that MRE can answer questions of the form (4,?) \u2208 +3 even though the training set\ncontains no examples of the basic relation +3. It can do this because it is told what +3 means by\nbeing given higher-order information about +3. It is told that (3, +3) \u2208 plus and it \ufb01gures out what\nplus means from higher-order examples of the form (2, +2) \u2208 plus and basic examples of the form\n(3, 5) \u2208 +2. This enables MRE to understand a relation from an \u201canalogical de\ufb01nition\u201d: if it is\ntold that has f ather to has mother is like has brother to has sister, etc., then MRE can answer\nqueries involving has f ather based on this analogical information alone. Finally, we show that\nMRE can learn new relations after an initial set of objects and relations has already been learned and\nthe learned matrices have been \ufb01xed. This shows that MRE can add new knowledge to previously\nacquired propositions without the need to relearn the original propositions. We believe that MRE\nis the \ufb01rst gradient-descent learning system that can learn new relations from de\ufb01nitions, including\nlearning the meanings of the terms used in the de\ufb01nitions. This signi\ufb01cantly extends the symbolic\nlearning abilities of connectionist-type learning algorithms.\n\nSome of the existing connectionist models for representing and learning relations and analogies\n[2, 4] are able to detect new relations and to represent hierarchical relations of high complexity.\nThey differ by using temporal synchrony for explicitly representing the binding of the relations to\nobject, and, more importantly, do not use distributed representations for representing the relations\nthemselves.\n\n2 The modular arithmetic task\n\nPaccanaro and Hinton [10] describe a very simple modular arithmetic task in which the 10 objects\nare the numbers from 0 to 9 and the 9 relations are +0 to +4 and \u22121 to \u22124. Linear Relational\nEmbedding easily learns this task using two-dimensional vectors for the numbers and 2 \u00d7 2 matrices\nfor the relations. It arranges the numbers in a circle centered at the origin and uses rotation matrices\nto implement the relations. We used base 12 modular arithmetic, thus there are 12 objects, and made\nthe task much more dif\ufb01cult by using both the twelve relations +0 to +11 and the twelve relations \u00d70\nto \u00d711. We did not include subtraction and division because in modular arithmetic every proposition\ninvolving subtraction or division is equivalent to one involving addition or multiplication.\n\nThere are 288 propositions in the modular arithmetic ntask. We tried matrices of various sizes and\ndiscovered that 4 \u00d7 4 matrices gave the best generalization when some of the cases are held-out. We\nheld-out 30, 60, or 90 test cases chosen at random and used the remaining cases to learn the real-\nvalued entries of the 12 matrices that represent numbers and the 24 matrices that represent relations.\nThe learning was performed by gradient descent in the cost function in Eq. 1. We repeated this \ufb01ve\ntimes with a different random selection of held-out cases each time. Table 1 shows the number of\nerrors on the held-out test cases.\n\n3 Details of the learning procedure\n\nTo learn the parameters, we used the conjugate gradient optimization algorithm available in the\n\u201cscipy\u201d library of the Python programming language with the default optimization parameters. We\ncomputed the gradient of the cost function on all of the training cases before updating the parameters,\nand initialized the parameters by a random sample from a spherical Gaussian with unit variance\non each dimension. We also included \u201cweight-decay\u201d by adding 0.01 Pi w2\ni to the cost function,\nwhere i indexes all of the entries in the matrices for objects and relations. The variance of the\nresults is due to the nonconvexity of the objective function. The implementation is available in\n[www.cs.utoronto.ca/\u223cilya/code/2008/mre.tar.gz].\n\n\fTest results for the basic modular arithmetic.\n\nerrors on 5 test sets\n\nmean test error\n\n(30)\n(60)\n(90)\n\n0\n29\n27\n\n0\n4\n23\n\n0\n0\n16\n\n0\n1\n31\n\n0\n0\n23\n\n0.0\n6.8\n24.0\n\nTable 1: Test results on the basic modular arithmetic task. Each entry shows the number of errors\non the randomly held-out cases. There were no errors on the training set. Each test query has 12\npossible answers of which 1 is correct, so random guessing should be incorrect on at least 90% of\nthe test cases. The number of held-out cases of each run is written in brackets.\n\nChristopher = Penelope\n\nAndrew = Christine\n\nMargaret = Arthur\n\nVictoria = James\n\nJennifer = Charles\n\nColin\n\nCharlotte\n\nAurelio = Maria\n\nBortolo = Emma\n\nGrazia = Pierino\n\nGiannina = Pietro\n\nDoralice = Marcello\n\nRA\n\nB\n\nAlberto\n\nMariemma\n\n(a)\n\nC\n\nD\n\n(b)\n\nFigure 1: (a) Two isomorphic family trees (b) An example of a situation in which the discriminative\ncost function in Eq. 1 causes the matrix RA produced by MRE to be far from the correct answer,\nB (see section 5).\n\nIn an attempt to improve generalization, we tried constraining all of the 4 \u00d7 4 matrices by setting\nhalf of the elements of each matrix to zero so that they were each equivalent to two independent\n2 \u00d7 2 matrices. Separate experiments showed that 2 \u00d7 2 matrices were suf\ufb01cient for learning either\nthe mod 3 or the mod 4 version of our modular arithmetic task, so the mod 12 version can clearly be\ndone using a pair of 2 \u00d7 2 matrices for each number or relation. However, the gradient optimization\ngets stuck in poor local minima.\n\n4 The standard family trees task\n\nThe \u201cstandard\u201d family trees task de\ufb01ned in [3] consists of the two family trees shown in \ufb01gure\n1(a) where the relations are {has husband, has wife, has son, has daughter, has father, has mother,\nhas brother, has sister, has nephew, has niece, has uncle, has aunt}. Notice that for the last four\nrelations there are people in the families in \ufb01gure 1(a) for whom there are two different correct\nanswers to the question (A,?) \u2208 R. When there are N correct answers, the best way to maximize\nthe sum of the log probabilities of picking the correct answer on each of the N cases is to produce\nan output matrix that is equidistant from the N correct answers and far from all other answers. If\nthe designated correct answer on such a case is not among the N closest, we treat that case as an\nerror. If we count cases with two correct answers as two different cases the family trees task has 112\ncases.\n\nWe used precisely the same learning procedure and weight-decay as for the modular arithmetic\ntask. We held-out 10, 20, or 30 randomly selected cases as test cases, and we repeated the random\nselection of the test cases \ufb01ve times. Table 2 shows the number of errors on the test cases when 4 \u00d7 4\nmatrices are learned for each person and for each relation. MRE generalizes much better than the\n\n\fTest results for the basic family trees task.\nerrors on 5 test sets mean test error\n0\n6\n0\n\n0.4\n1.2\n2.0\n\n(10)\n(20)\n(30)\n\n2\n0\n4\n\n0\n0\n2\n\n0\n0\n4\n\n0\n0\n0\n\nTable 2: Test results on the basic family trees task. Each entry shows the number of errors on the\nrandomly held-out cases. There were no errors on the training set. The same randomly selected test\nsets were used for the 4 \u00d7 4 matrices. Each test query has 24 possible answers, of which at most 2\nobjects are considered correct. As there are 24 objects, random guessing is incorrect on at least 90%\nof the cases.\n\nfeedforward neural network used by [3] which typically gets one or two test cases wrong even when\nonly four test cases are held-out. It also generalizes much better than all of the many variations\nof the learning algorithms used by [8] for the family trees task. These variations cannot achieve\nzero test errors even when only four test cases are held-out and the cases are chosen to facilitate\ngeneralization.\n\n5 The higher-order modular arithmetic task\n\nWe used a version of the modular arithmetic task in which the only basic relations were\n{+0, +2, . . . , +11}, but we also included the higher-order relations plus, minus, inverse consist-\ning of 36 propositions, examples of which are (3, +3) \u2208 plus; (3, +9) \u2208 minus; (+3, +9) \u2208 inverse.\nWe then held-out all of the examples of one of the basic relations and trained 4 \u00d7 4 matrices on all\nof the other basic relations plus all of the higher-order relations.\n\nOur \ufb01rst attempt to demonstrate that MRE could generalize from higher-order relations to basic\nrelations failed: the generalization was only slightly better than chance. The failure was caused by\na counter-intuitive property of the discriminative objective function in Eq. 1 [9]. When learning the\nhigher-order training case (3, +3) \u2208 plus it is not necessary for the product of the matrix representing\n3 and the matrix representing plus to be exactly equal to the matrix representing +3. The product\nonly needs to be closer to +3 than to any of the other matrices. In cases like the one shown in \ufb01gure\n1(b), the relative probability of the point B under a Gaussian centered at RA is increased by moving\nRA up, because this lowers the unnormalized probabilities of C and D by a greater proportion than\nit lowers the unnormalized probability of B. The discriminative objective function prevents all of the\nrepresentations collapsing to the same point, but it does not force the matrix products to be exactly\nequal to the correct answer. As a result, the representation of +3 produced by the product of 3 and\nplus does not work properly when it is applied to a number.\n\nTo overcome this problem, we modi\ufb01ed the cost function for training the higher-order relations so\nthat it is minimized when \u02dcRA is exactly equal to B\n\nC = X\n\u02dcR\u2208 \u02dcR\n\nX\n\nk \u02dcRA \u2212 Bk2,\n\n(A,B)\u2208 \u02dcR\n\n(2)\n\nwhere \u02dcR ranges over \u02dcR, the set of all higher-order relations, and A and B can be either relations or\nbasic objects, depending on \u02dcR\u2019s domain.\nEven when using this non-discriminative cost function for training the higher-order relations, the\nmatrices could not all collapse to zero because the discriminative cost function was still being used\nfor training the basic relations. With this modi\ufb01cation, the training caused the product of 3 and plus\nto be very close to +3 and, as a result, there was often good generalization to basic relations even\nwhen all of the basic relations involving +3 were removed from MRE\u2019s training data and all it was\ntold about +3 was that (3, +3) \u2208 plus, (9, +3) \u2208 minus, and (+9, +3) \u2208 inverse (see table 3).\n\n\fTest results for higher-order arithmetic task.\n\n+1 (12)\n+4 (12)\n+6 (12)\n+10 (12)\n\nerrors on 5 test sets mean test error\n5\n0\n0\n3\n\n1.0\n2.6\n2.8\n3.6\n\n0\n1\n0\n7\n\n0\n6\n4\n0\n\n0\n6\n4\n0\n\n0\n0\n6\n8\n\nTable 3: Test results on the higher-order arithmetic task. Each row shows the number of incorrectly\nanswered queries involving a relation (i.e., +1, +4, +6, or +10) all of whose basic examples were\nremoved from MRE\u2019s training data, so MRE\u2019s knowledge of this relation was entirely from the\nother higher-order relations. Learning was performed 5 times starting from different initial random\nparameters. There were no errors on the training set for any of the runs. The number of test cases is\nwritten in brackets.\n\nTest results for the higher-order family trees task.\n\nhas father (12)\nhas aunt (8)\nhas sister (6)\nhas nephew (8)\n\nerrors on 5 test sets mean test error\n0\n4\n2\n0\n\n2.4\n4.0\n0.4\n1.6\n\n12\n8\n0\n0\n\n0\n4\n0\n8\n\n0\n4\n0\n0\n\n0\n0\n0\n0\n\nTable 4: Test results for the higher-order family trees task.\nIn each row, all basic propositions\ninvolving a relation are held-out (i.e., has father, has aunt, has sister, or has nephew). Each row\nshows the number of errors MRE makes on these held-out propositions on 5 different learning runs\nfrom different initial random parameters. The only information MRE has on these relations is in the\nform of a single higher-order relation, higher oppsex. There were no errors on the training sets for\nany of the runs. The number of held-out cases is written in brackets.\n\n6 The higher-order family trees task\nTo demonstrate that similar performance is obtained on family trees task when higher-order relations\nare used, we included in addition to the 112 basic relations the higher-order relation higher oppsex.\nTo de\ufb01ne higher oppsex we observe that many relations have natural male and natural female\nversions, as in: mother-father, nephew-niece, uncle-aunt, brother-sister, husband-wife, and son-\ndaughter. We say that (A, B) \u2208 higher oppsex for relations A and B if A and B can be seen as\nnatural counterparts in this sense. Four of the twelve examples of higher oppsex are given below:\n\n1. (has father, has mother) \u2208 higher oppsex\n2. (has mother, has father) \u2208 higher oppsex\n3. (has brother, has sister) \u2208 higher oppsex\n4. (has sister, has brother) \u2208 higher oppsex\n\nWe performed an analogous test to that in the previous section on the higher order modular arithmetic\ntask, using exactly the same learning procedure and learning parameters. For the results, see table\n4.\nThe family trees task and its higher-order variant may appear dif\ufb01cult for systems such as MRE or\nLRE because of the logical nature of the task, which is made apparent by hard rules such as (A, B) \u2208\nhas father, (A, C) \u2208 has brother \u21d2 (C, B) \u2208 has father. However, MRE does not perform any ex-\nplicit logical deduction based on explicitly inferred rules, as would be done in an Inductive Logic\nProgramming system (e.g., [7]). Instead, it \u201cprecomputes the answers\u201d to all queries during training,\nby \ufb01nding the matrix representation that models its training set. Once the representation is found,\nmany correct facts become \u201cself-evident\u201d and do not require explicit derivation. Humans may be\nusing a somewhat analogous mechanism (thought not necessarily one with matrix multiplications),\nsince when mastering a new and complicated set of concepts, some humans start by relying heavily\non relatively explicit reasoning using the de\ufb01nitions. With experience, however, many nontrivial\ncorrect facts may become intuitive to such an extent that experts can make true conjectures whose\nexplicit derivation would be long and dif\ufb01cult. New theorems are easily discovered when the repre-\nsentations of all the concepts make the new theorem intuitive and self-evident.\n\n\fThe sequential higher-order arithmetic task.\n\n+1 (12)\n+4 (12)\n+6 (12)\n+10 (12)\n\nerrors on 5 test sets\n4\n0\n3\n10\n0\n0\n10\n0\n\n0\n8\n0\n4\n\n0\n8\n4\n8\n\n2\n0\n9\n0\n\nmean test error\n\n1.2\n5.8\n2.6\n4.4\n\nThe sequential higher-order family trees task.\n\nhas father (12)\nhas aunt (8)\nhas sister (6)\nhas nephew (8)\n\nerrors on 5 test sets\n0\n0\n0\n0\n0\n0\n0\n0\n\n10\n8\n0\n0\n\n0\n0\n0\n0\n\n0\n0\n0\n0\n\nmean test error\n\n2.0\n1.6\n0.0\n0.0\n\nTable 5: Test results for the higher-order arithmetic task (top) and the higher-order family trees task\n(bottom) when a held-out basic relation is learned from higher-order propositions after the rest of the\nobjects and relations have been learned and \ufb01xed. There were no errors on the training propositions.\nEach entry shows the number of test errors, and the number of test cases is written in brackets.\n\nFigure 2: A neural network that is equivalent to Matrix Relational Embedding (see text for details).\n\nThis is analogous to the idea that humans can avoid a lot of explicit search when playing chess\nby \u201ccompiling\u201d the results of previous searches into a more complex evaluation function that uses\nfeatures which make the value of a position immediately obvious.\n\nThis does not mean that MRE can deal with general logical data of this kind, because MRE will fail\nwhen there are many relations that have many special cases. The special cases will prevent MRE\nfrom \ufb01nding low dimensional matrices that \ufb01t the data well and cause it to generalize much more\npoorly.\n7 Adding knowledge incrementally\nThe previous section shows that MRE can learn to apply a basic relation correctly even though the\ntraining set only contains higher-order propositions about the relation. We now show that this can be\nachieved incrementally. After learning some objects, basic relations, and higher-order relations, we\nfreeze the weights in all of the matrices and learn the matrix for a new relation from a few higher-\norder propositions. Table 5 shows that this works about as well as learning all of the propositions at\nthe same time.\n8 An equivalent neural network\nConsider the neural network shown in Figure 2. The input vectors R and A represent a relation and\nan object using a one-of-N encoding. If the outgoing weights from the two active input units are\nset to R and A, these localist representations are converted into activity patterns in the \ufb01rst hidden\nlayer that represent the matrices R and A. The central part of the network consists of \u201csigma-pi\u201d\nunits [12], all of whose incoming and outgoing connections have \ufb01xed weights of 1. The sigma-pi\nunits perform a matrix multiplication by \ufb01rst taking the products of pairs of activities in the \ufb01rst\nhidden layer and then summing the appropriate subsets of these products. As a result, the activities\nin the next layer represent the matrix RA. The output layer uses a \u201csoftmax\u201d function to compute\nthe probability of each possible answer and we now show that if the weights and biases of the output\n\n\funits are set correctly, this is equivalent to picking answers with a probability that is proportional to\ntheir probability density under a spherical Gaussian centered at RA. Consider a particular output\nunit that represents the answer B. If the weights into this unit are set to 2B and its bias is set to\n\u2212kBk2, the total input to this unit will be:\n\nTotal input = \u2212kBk2 + 2 X\n\n(RA)ij Bij\n\n(3)\n\nThe probability that the softmax assigns to B will therefore be:\n\nij\n\np(B|A, R) =\n\n\u2212kBk2+2Pij\ne\n\n(RA)ij Bij\n\n\u2212kCk2+2Pij\n\nPC e\n\n(RA)ij Cij\n\n\u2212kBk2+2Pij\ne\n\n(RA)ij Bij \u2212kRAk2\n\n(RA)ij Cij \u2212kRAk2 =\n\n=\n\n\u2212kCk2+2Pij\n\nPC e\n\ne\u2212kRA\u2212Bk2\n\nPC e\u2212kRA\u2212Ck2\n\n(4)\nMaximizing the log probability of p(B|R, A) is therefore equivalent to minimizing the cost function\ngiven in Eq. 1.\nThe fact that MRE generalizes much better than a standard feedforward neural network on the family\ntrees task is due to two features. First, it uses the same representational scheme (i.e., the same\nmatrices) for the inputs and the outputs, which the standard net does not; a similar representational\nscheme was used in [1] to accurately model natural language. Second, it uses \u201csigma-pi\u201d units that\nfacilitate multiplicative interactions between representations. It is always possible to approximate\nsuch interactions in a standard feedforward network, but it is often much better to build them into\nthe model [13, 5, 11].\nAcknowledgments\nWe would like to thank Alberto Paccanaro and Dafna Shahaf for helpful discussions. This research\nwas supported by NSERC and CFI. GEH holds a Canada Research Chair in Machine Learning and\nis a fellow of the Canadian Institute for Advanced Research.\nReferences\n[1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal\n\nof Machine Learning Research, 3:1137\u20131155, 2003.\n\n[2] L.A.A. Doumas, J.E. Hummel, and C.M. Sandhofer. A Theory of the Discovery and Predication of\n\nRelational Concepts. psychological Review, 115(1):1, 2008.\n\n[3] G.E. Hinton. Learning distributed representations of concepts. Proceedings of the Eighth Annual Confer-\n\nence of the Cognitive Science Society, pages 1\u201312, 1986.\n\n[4] J.E. Hummel and K.J. Holyoak. A Symbolic-Connectionist Theory of Relational Inference and General-\n\nization. Psychological Review, 110(2):220\u2013264, 2003.\n\n[5] R. Memisevic and G.E. Hinton. Unsupervised learning of image transformations. Proceedings of IEEE\n\nConference on Computer Vision and Pattern Recognition, 2007.\n\n[6] T.M. Mitchell. The need for biases in learning generalizations. Readings in Machine Learning. Morgan\n\nKaufmann, 1991.\n\n[7] S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal of Logic\n\nProgramming, 19(20):629\u2013679, 1994.\n\n[8] R.C. O\u2019Reilly. The LEABRA Model of Neural Interactions and Learning in the Neocortex. PhD thesis,\n\nCarnegie Mellon University, 1996.\n\n[9] A. Paccanaro. Learning Distributed Representations of Relational Data Using Linear Relational Embed-\n\nding. PhD thesis, University of Toronto, 2002.\n\n[10] A. Paccanaro and G. Hinton. Learning Distributed Representations of Concepts using Linear Relational\n\nEmbedding. IEEE Transactions on Knowledge and Data Engineering, 13(2):232\u2013245, 2001.\n\n[11] R.P.N. Rao and D.H. Ballard. Development of localized oriented receptive \ufb01elds by learning a translation-\n\ninvariant code for natural images. Network: Computation in Neural Systems, 9(2):219\u2013234, 1998.\n\n[12] D.E. Rumelhart, G.E. Hinton, and J.L. McClelland. A general framework for parallel distributed process-\n\ning. Mit Press Computational Models Of Cognition And Perception Series, pages 45\u201376, 1986.\n\n[13] J.B. Tenenbaum and W.T. Freeman. Separating Style and Content with Bilinear Models. Neural Compu-\n\ntation, 12(6):1247\u20131283, 2000.\n\n\f", "award": [], "sourceid": 259, "authors": [{"given_name": "Ilya", "family_name": "Sutskever", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}