
Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a simple and scalable approach to
modeling multirelational data using lowdimensional vector embeddings
of entities, with the relationships between embeddings captured using
offset vectors. The embeddings are learned by training a marginbased
ranking model to score the observed entity1,relationship,entity2
triples higher than the unobserved ones.
Though the proposed model
can be seen as a special case of several existing models (e.g. [1],
[2], [3]), it seems to work considerably better than them despite (or
most likely because of) being considerably simpler. The approach is
well motivated and clearly described. The empirical evaluation is
reasonably well done, but the write up could be better. The paper
really should report the hyperparameter values that produced the best
performing models along with the corresponding training times and the
parameter counts. It is also unclear whether L1 or L2 distances were
used in the experiments.
Unfortunately, the authors do not provide
a clear explanation of the fact that despite being less expressive
than several of the models it is compared to, TransE outperforms them.
It is important to understand whether this is due to overfitting or
underfitting, as answering this question might lead to better results.
Reporting model performance on a subset of training data would be
quite helpful for this. If the more expressive models underfit, TransE
can be used to initialize them better, and if they overfit, TransE can
be used to regularize them in an interesting way.
It might be
clearer to define the margin loss in terms of scores instead of
energies, as is more common in the ranking literature.
Why do
the numbers for SME do not match those from [2] on WordNet? Was a
different setup used?
Why was the Neural Tensor Model not
included in the experimental evaluation?
How important is it to
constrain the L2 norm of the embeddings of entities to be 1?
Typos: Line 3 of Algorithm 1: j => k Table 1 caption: d
=> k Table 2: 10^9 => 10^6 Line 183:
learn=>learning Q2: Please summarize your review in
12 sentences
A well executed paper that introduces a simple and
effective method for modeling multirelational data.
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper describes a simple model denoted TransE for
learning embeddings of entities and dyadic relationships between pairs of
entities. The TransE model is an important special case of other models
that already exist in the literature. The simplifying assumptions of the
model enable more effective and rapid training which in turn yield
improved results on the two datasets used in this paper over alternative
models from the literature. The model operates on triplets (h,l,t) where h
is the "head" entity, "l" indicates the type of relation, and "t" is the
tail entity. The energy assigned to a triplet (where vect(h) indicates the
embedding vector for item h learned by the model) is d(vect(h) + vect(l),
vect(t)), where d(.) is either the L1 or L2 distance. The paper describes
results on predicting links in two knowledge bases: WordNet and a subset
of Freebase. On both datasets TransE substnatially outperforms all of the
baselines.
The paper is clear and the organization is
straightforward. The choice of the filtered evaluation metric makes sense
and the baselines are reasonable. However, I would have appreciated a few
more details on the experimental results since the demonstration of the
success of TransE is the important contribution of the paper. For example,
whether L1 or L2 distance worked better should be included. Since the
model is so simple, a more extensive exploration of its failure modes
would be valuable. Are there any patterns in the relations or entities it
fails to perform well on? Presenting some clearly erroneous examples along
with hypotheses about why the model failed to handle those cases would be
quite interesting.
What effect does the norm constraint on the
embedding vectors have? Do any of the baseline models use a similar
constraint? Some alternative models certainly don't use such a constraint
and I wonder how important the constraint is to the success of the TransE
training procedure and model. A sentence or two on this issue would
improve the paper.
Small corrections: The last line of page six
should have "ManyToMany" instead of
"1ToMany." Q2: Please summarize your review in 12
sentences
This paper is clear and provides a useful empirical
examination of an appealing multirelational embedding learning model that
is closely related to other models in the literature. The experiments in
the paper justify giving the particular model more attention from the
community. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a model of multirelational data,
such as the set of facts of the form (subject, predicate, object) that
comprise the Freebase database. Perhaps the most common use case of such a
model is the following: given a triple with either the subject or object
missing, rank all possible completions of that triple. The proposed model
is energybased, and thus assigns a score to each possible completion. The
lower the score, the more likely that triple represents a true fact that
is typically not already present in the database. In addition to a
description of the model, the paper presents a training algorithm and
reports largescale experimental results on Freebase and other datasets on
the aforementioned completion task.
The essense of the model is to
embed each entity and relation into a continuous vector space. The
dimensions of this space have no direct interpretable meaning except that
entities which are semantically similar (in the sense of participating in
the same set of facts) will be nearby in the latent space (according to L1
or L2), as is standard in matrix factorization. The latent representation
of the tail of a triple is modeled as a translation of the latent
representation of the head of the triple, with a different learned
translation vector for each relation. The quality of a proposed latent
embedding is given by an objective function which is essentially the total
score of all observed true facts minus the total score of all triples not
known to be true, plus a margin term. The parameters of the model (a
latent vector for each entity and relation) are learned via a
straightforward stochastic gradient descent algorithm from a training set
of known facts.
Many similar latentembedding models of
multirelational data exist in the literature. The main novelty in this
paper is the restriction that the embedding of tail of a triple be a
relationshipdependent translation of the embedding of the subject, rather
than an arbitrary linear transformation. I believe this boils down to the
assumption that the contribution of each of the features represented by
each dimension of the latent space to the energy function are independent
of each other, for a given relationship. As the authors points out, this
is strictly a special case of of the 'Structured Embeddings' model. SE
embeds each relationship into a matrix rather than a vector, allowing for
the possiblity that the transformation represented by the relationship
between head and tail embeddings is an arbitrary linear transformation.
The experimental results use the established metrics for this kind
of model and show impressive stateoftheart performance on standard
benchmarks. It seems that the expressivity reduction of this model allows
for more efficient training and more resilience to overfitting. As these
benchmarks are standard, I have no serious complaint with the methodology
here. Some minor comments: It would have been interesting to also see
results for a task where a missing relationship has to be predicted given
a head and tail. The 'filtered' dataset concept, while a good innovation
over the 'raw' dataset, seems superflous to this paper given how close the
results are between them and since it was not even used for the FB1M
dataset, which is the largest and perhaps most important dataset the
authors tested on. The authors claim all experimental results are
significant (line 302), but do not name the statistical test being used or
justify the assumptions behind the test. They also do not say how the
latent dimensionality constant was chosen (268).
My main concern
with this paper is that is fairly light on novelty and is rather more
incremental than a toptier conference demands. The model is a simple
restriction of Structued Embedding, an existing, wellknown model:
relationship embeddings are replaced with vectors instead of matrices, and
so encode translations in latent space rather than arbitrary linear
transformations. The learning algorithm is entirely standard stochastic
gradient descent. The empirical results are standard benchmarks on
standard datasets. While the restriction is shown to be benefial on these
tasks, there is somewhat of a lack of indepth analysis of why. Vague
claims in lines 7193 and on 304 ('appropriate design of the model
according to the data') are made that this model is wellsuited to the
data, but the experiments do not make it clear what is actually helping
over SE: Is it that the model is more efficiently trainable? Is it the
margin term in the optimization function? Is it that it inherently
generalizes better? What exactly is it about these datasets that make
translation a good model of relations? The authors present each of these
as hypotheses but do not elaborate. As it stands, I feel this paper is
presenting a useful idea which could become the core of a good paper, but
it needs a deeper analysis of why and when this restriction works.
Some minor comments: * The handling of negative data and the role
the closedworld assumption is making to the training procedure was
somewhat unclear to me in 134148. What exactly is being assumed about the
truth of facts missing from the database? * On 195, the authors state that
experiments with the Neural Tensor Model could not be run, but do not give
a reason. * On 236, they state that a corrupted triple 'would' be ranked
about the test triplet if the corrupted triple is in fact true, but I
would think it may or may not be. * The paper only has a single figure,
which comes at the end and displays results. I think a figure earlier in
the paper presenting a graphical depiction of the model and its output
would be useful.
Q2: Please summarize your review in
12 sentences
The authors present a model and technically sound
training algorithm for multirelational data that achieves
stateoftheart performance on accepted benchmarks. However, the model is
a small variation of an existing established model and the training
algorithm is standard, calling into question the novelty of the work.
Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper describes a nice method for relation
classification using simple vector additions.
The method is very
simple and a special case of previous models. But it was shown that
this special case works well even with few parameters.
Added after
rebuttal: Upped score to 6. The data used from [2] is still very
problematic since it includes test triplets (x,Relation,y) for many
symmetric relationships such as is_similar and with (y,Relation,x) in the
training set. I believe that is one of the reasons the simple translation
model performs so well. It would be good to exclude these many simple test
cases. But since it's hard to verify whether that's the main reason, I
upped to score to 6. Q2: Please summarize your review in
12 sentences
This paper describes a nice and simple method for
relation classification using simple vector additions.
The
evaluation is ok, some more comparisons to recent work on tensor
factorization would have made the paper better.
It seems like this
model would work well for relations with very little training data (since
the model cannot overfit so easily) but worse for datasets with more
triplets to train with. An analysis of this would be nice.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their very detailed and
helpful comments. We try to address them below, with a particular focus on
detailing the novelty and behavior of the model.
**
Novelty/interest ** The main contribution (hence novelty) of our paper
consists in the brand new model formulation, which fits nicely the KBs. We
believe that its very interesting properties could be of great use for
people in the NIPS community interested in modeling/using multirelational
data:  It is easy to train (less underfitting see below) and to
implement. This seems crucial on this problem for which optimization can
not be properly conducted.  It involves less parameters (more
compact).  It performs well on realworld KB data.  It can be
limited to more sophisticated multirelational data (see discussion below)
but, as pointed by R4, it could then serve as initialization for more
complex systems, and potentially reduce their underfitting.
**
Why does the model work well? **  R4/R6: "Reporting model performance
on a subset of training data would be quite helpful for this" "it that the
model is more efficiently trainable? Is it the margin term in the
optimization function? Is it that it inherently generalizes better?”
Here are some train results on FB15k (mean rank/hits@10): 
Unstructured : 938 / 5.3%  SE : 165 / 35.5%  SME(linear) : 151 /
36.4%  SME(bilinear) : 155/ 34.2%  TransE : 127 / 42.7%
This shows that TransE gets the best results on train (as on
test), indicating that this model is actually better trained, whereas the
other, more expressive, approaches like SE are underfitting. This is not
due to the margin term since all above methods also use a margin in
training, but to the simple architecture of TransE.
 R7: “It
seems like this model would work well for relations with very little
training data (since the model cannot overfit so easily) but worse for
datasets with more triplets to train with.”
Our model is unlikely
to overfit (less parameters), but as we showed above, it is also less
subject to underfitting. So, it should work well in both case, large data
sets and small data sets. The experiment of Section 4.4 confirms that.
** Failure modes ** R5/R6:"a more extensive exploration of its
failure modes would be valuable. Are there any patterns in the relations
or entities it fails to perform well on." " What exactly is it about these
datasets that make translation a good model of relations?"
The
translationbased formulation of TransE can be seen as encoding a series
of twoway interactions (e.g. by developing the L2 version). This seems
satisfactory to model data from generic KBs, since they do not involve too
many relationships with complex ternary interactions. However, for other
datasets, for which one must take into account both the 3way dependencies
between h, l and t, TransE should fail. This should be the case for the
Kinships dataset used in [1,6,10] for instance ([1] showed that 3way
interactions are crucial there). Our submission shows that for modeling
generic largescale KB, one should rather model properly the most frequent
connectivity patterns, which can be done with 2way interactions only.
** Constraining the L2 norm of the embeddings of entities to
be 1** R4/R5: "How important is it to constrain the L2 norm of the
embeddings of entities to be 1?" "Do any of the baseline models use a
similar constraint?"
This constraint is important for our model,
as it is for embedding methods learning a similarity function (SE, SME,
LFM). This prevents the training to artificially satisfy the ranking
criterion by blowing up entity embeddings norms. RESCAL (based on
reconstruction) does not have this constraint.
** Other points
**  R4: "The paper really should report the hyperparameter values
that produced the best performing models along with the corresponding
training times and the parameter counts. It is also unclear whether L1 or
L2 distances were used in the experiments."
On WN: k=20, learning
rate=0.01, similarity=L1 distance, margin=2. On FB15k: k=50, learning
rate=0.01, similarity=L1 distance, margin=1. On FB1M: k=50, learning
rate=0.01, similarity=L2 distance, margin=1.
Training times were
not provided because experiments have been conducted on different
computers, making time comparison quite meaningless.
 R5:
"Why do the numbers for SME not match those from [2] on WordNet? Was a
different setup used?"
The authors of [2] published an erratum on
their website
(https://www.hds.utc.fr/~bordesan/dokuwiki/doku.php?id=en:aaai11_erratum)
, because they reported some bugs in their data sets (overlap between
train and test sets). We used the fixed WN data version they propose and
our numbers correspond to those of the erratum.
 R5: "Why was
the Neural Tensor Model not included in the experimental evaluation?"
The Neural Tensor Model has been published recently (ICLR'13 in
May), shortly before the NIPS deadline. Besides, the code is not yet
available, to the best of our knowledge.
 R6:"They also do
not say how the latent dimensionality constant was chosen (268)."
We agree that the corresponding sentence is unclear. The dimension
was chosen among {20, 50} on a validation set.
 R7:”The
evaluation is ok, some more comparisons to recent work on tensor
factorization would have made the paper better.”
We compare with
six recent methods that have shown to reach stateoftheart performances
on various benchmarks in the last 2 years. This provides a wide coverage
of what has been proposed recently. For us, the only missing method could
be the Neural Tensor Model, for which we could not run the experiments
(see reasons above).
 