
Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper analyzes the sample complexity of pair wise
interaction tensors and provides a scalable optimization algorithm based
on SVT algorithm.
Overall, the paper is written well. The problem
is motivated and the theorems are presented clearly. There are only a few
minor issues.
In line 234, $\delta_i$ is used without
introduction. Same issue appeares in the statement of Theorem 4 in
appendix where $W$ is not explained.
In the temporal collaborative
filtering experiment, trace norm related algorithm for Movielens dataset
usually works relatively well. It is surprising to see the algorithm "MC"
only get 0.92 Test RMSE. It could be that the mean rating values are not
removed before MC algorithm. Removing mean values is a standard
preprocessing approach. The RPIT algorithm may still outperform MC but not
in a significant way as the temporal effect for recommendation system is
usually relatively small (see experiments in paper [1] for example).
[1] Liang Xiong, Xi Chen, Tzukuo Huang, Jeff Schneider, Jaime G.
Carbonell. Temporal Collaborative Filtering with Bayesian Probabilistic
Tensor Factorization. SIAM Data Mining 2010.
Q2: Please summarize your review in 12
sentences
The paper provides theoretical analysis of sample
complexity of pairwise interaction tensors. The overall proof is clear
and the algorithm for solving the problem is efficient.
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposed a new tensor factorization model
and interesting applications. Each entry of data tensor is expressed by a
sum of three scalar products. The authors propose a new expression to
analyze the pairwise interaction tensor. The model is related to special
cases of BTD (Block term decomposition), SOD (slice oriented
decomposition) and KTD (Kronecker tensor decompositions) The link between
this specific model and existing more genral and flexible models including
CPD, BTD, SOD, and Ticker models should be discussed. The authors consider
only 3rd tensors.
Minor comments
 Line 69 needs an
additional reference: B Recht, M Fazel, PA Parrilo, Guaranteed
minimumrank solutions of linear matrix equations via nuclear norm
minimization, arXiv, 2007, SIAM Review 52 (3), 471501, 2010.
since Recht et. al. appeared online in 2007 before Gross et. al
2010.  Proposition 1. Not any T. T should have condition, or
should be centralized due to condition 1^T A = 0, 1^T B = 0, and 1^T C
= 0.
 Lines 130 144: Discussion on coherence needs references.
 Line 234, delta_i is never defined in the paper.
 Line
269, why is the shrinkA much different from those for B and C?

Experiments If several components in A, B and C are highly collinear,
the tensor is hard to be decomposed, and we may not obtain the optimal
solution. This is important issue that the authors do not address at
all in the simulation section. Note that simulations on general random
factor matrices are often good in practice. The authors need to prove
their algorithms for difficult scenarios, not for the normal cases.
Since the model can also be expressed in form of CPD
(CANDECOMP/PARAFAC) and most of algorithms for CPD can be also employed.
The authors missed to compare the proposed algorithm with CPD
algorithms such as the fast damped GaussNewton algorithms, ALS with
line search:
A.H. Phan, P. Tichavsk´y, and A. Cichocki, “Low
complexity damped GaussNewton algorithms for CANDECOMP /PARAFAC,”
SIAM J. Matrix Anal. Appl. vol 34(1), pp. 126–147, 2013.
L.
Sorber,M. Van Barel, and L. De Lathauwer, “Optimizationbased
algorithms for tensor decompositions: Canonical polyadic
decomposition, decomposition in rank(lr,lr,1) terms and a new
generalization,” Tech. Rep., University of Leuven, 2012.
The
error bound for decomposition of missing data can be found in Petr
Tichavský, Anh Huy Phan, Zbynek Koldovský, CramerRaoInduced Bounds for
CANDECOMP/PARAFAC tensor decomposition , IEEE, Transaction on Signal
Processing, pp. 19861997, vol. 61, no. 8, 2013.
Q2: Please summarize your review in 12
sentences
This an interesting and a v. good paper with strong
theoretical results and convinced simulations. Theoretical results are
proved rigorously.
Th authors demonstrated that using their tensor
model and approach they can obtain performance comparable with best the
state of the arts methods Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
(*) The paper provides algorithms with recovery
guarantees for the special problem of pairwise interaction tensor
completion. The algorithms use a convex optimization formulation.
Results are proven both for exact recovery, and for recovery in the
presence of noise. In addition, there are also experimental results on
simulated and real datasets, demonstrating the success of tensor
decomposition in predicting unobserved entries.
(*)
Overall, the paper is original, provide a significant contribution to a
hard problem of tensor decomposition, and looks technically impressive
(I could not verify all the proofs, they are quite involved). It is
also written quite clearly. The theoretical results showing robustness to
noise, and the encouraging experimental results show that the author's
approach has great potential.
(*) It is not clear how restrictive
is the requirement to factor a tensor as a pairwise interaction tensor
(compared to more general tensors) A better explanations on this
restriction is needed. (e.g. dimension, number of parameters needed, how
much would one 'lose' if the true decomposition is more general, yet
one uses the pairwise interaction decomposition?).
(*) One
weakness is that the authors present all of their results for
3dimensional tensors. This is fine for sake of presentation, but it is
not clear how do the authors results generalize to higherorder
tensors. Is this all straightforward, or would it be hard to extend the
techniques used? a discussion of the applicability of the techniques
to higherorder tensors should be added.
(*) Another possible
concern in the bounds is that the reconstruction guarantees provide bounds
on the nuclear norm of the difference between the true tensor and the
reconstructed tensor (e.g. Theorem 2). It would have been better to
obtain a bound on a more natural L2 norm (Meansquared error). Can this be
deduced by the methods used by the authors?
(*) The authors
discuss the computational complexity of their algorithm (of each
iteration) in page 6, lines 292304. Would be also good to include the
memory requirements (large high dimensional tensors may be to big to hold
in memory. Is the entire tensor need to be stored, or each time only a
small amount of data?), and the number of iterations required to
converge (i. are there theoretical guarantees on the error in the convex
program as a function of the number of iteration? ii. How many
iterations were required for the experimental examples?).
(*) It
looks like there are many implementation details which are important for
allowing scalability of the author's algorithm. It is nice that the
authors give more details of the algorithms in the appendix, but it
would be better to also provide the author's software implementation
available for the community.
(*) The authors should provide a
reference for the movielens dataset used. Where was it downloaded from?
(url? list a reference?), or alternatively, make the data available.
(*) The description of the experimental results (Figure 1) looks
wrong. In the text (lines 319320) it is said that the yaxis is the ratio
m/d, but this seems to be the xaxis (hard to tell, the axis labels
are too small). What is the yaxis? the rank r? (also too small).
Also, it is not clear to me why is recovery error stable as the number
of observations is increased (table 1 (b)). I would expect the error
to go to zero as more and more measurements are added (even if these
are noisy measurements). In Figure 2  what are the units? is RMSE
somehow normalized to 1? (or maybe just coincidence) were the movie
ratings normalized somehow?
(*) I don't understand theorem 2
(robustness to noise). It is stated that if the measurement error is
\epsilon_1, and one solves an optimization problem allowing error
\epsilon_2, then the overall reconstruction error is bounded (in
probability) by \epsilon_1+\epsilon_2. If this is true, why wouldn't
one want to decrease \epsilon_2 as much as possible (even take
\epsilon_2=0?) This will reduce the reconstruction error! however,
intuitively, it seems to me that under measurement noise you want to
allow some noise \epsilon_2 and not take it to zero, otherwise you may
increase the dimension, or fail to satisfy the constraints. So how is
this consistent with the bound given by the authors?
(*) Some of
the proofs steps are unclear. For example, in page 7, lines 335341: why
and how did the authors go from 26 \mu_0 r / n_1 n_2, to an expression
involving also division by n_1 n_3, and then back to 28 \mu_0 r / n_1 n_2?
Minor: Page 2, line 88: 'show a quadratically ..' > 'show
that a quadratically ..'
Page 3, line 146: 'sampled from' >
'sampled'
Page 4, line 206: 'recover algorithm' > 'recovery
algorithm'
Page 5, line 238: 'be the adjoint' > 'the adjoint'
Q2: Please summarize your review in 12
sentences
Overall, the paper is original, provide a significant
contribution to a hard problem of tensor decomposition, and looks
technically impressive (I could not verify all the proofs, they are
quite involved). It is also written quite clearly. The theoretical results
showing robustness to noise, and the encouraging experimental results
show that the author's approach has great potential.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Thank you for your comments.
To Reviewer 1:
1. $\delta_i$ is the Kronecker delta. The definition of $W$ should
be $\diag(U_AV_A^T, U_BV_B^T,U_CV_C^T)$ (a similar variable $\W^\orth$ is
defined on [line 232, supplementary]). We will add the definitions.
2. In theory, matrix completion algorithm (MC) does not require
the mean of input matrix to be removed, since it is expected to ‘recover’
the matrix no matter whether the input matrix is already centered or not.
Similarly, in a quantitative study of matrix completion algorithms by
[Keshavan et al. 2009], the authors used the raw ratings of MovieLens 100K
(note: we used MovieLens 1M) directly without removing mean value as the
input of several matrix completion algorithms.  In addition, the
input matrix for the compared RPIT algorithm is also not centered, which
ensures that the comparison of RPIT and MC is fair in the sense that they
have the same input.
[Keshavan et al. 2009]: Keshavan, Raghunandan
H., Andrea Montanari, and Sewoong Oh. "Lowrank matrix completion with
noisy observations: a quantitative comparison." Allerton 2009.
To Reviewer 2:
 About the citation on line 69. We
cited the work of Gross et al., 2010 [8] since their result and proof
technique are the most closely related. The recovery result of Recht et
al., 2007, on the other hand, requires the observation operator to satisfy
rRIP (restricted isometry property), which does not hold for the
observation operators we dealt with in our scenario in general.

Proposition 1 indeed holds for any pairwise interaction tensor T. In fact,
Proposition 1 shows that for any T=Pair(A,B,C), one can transform it to a
(unique) ``centered’’ version Pair(A,B,C)=T=Pair(A’,B’,C’), where A’,B’,C’
satisfies the centering conditions. (line 121125). The transformation
method is given in the proof of Proposition 1 [Section 4, supplementary].
 We will add the reference for coherence (e.g. Candes and Recht,
2009 [5]). Thanks for pointing out.
 $\delta_i$ is the Kronecker
delta. We will add the definition.
 The difference among
computation methods of shrink_A, shrink_B and shrink_C inherits from the
definitions of S_A, S_B and S_C, which are slightly different (the
definition is at line 123125).
 On the concern ``If several
components in A, B and C are highly collinear, the tensor is hard to be
decomposed, and we may not obtain the optimal solution. ’’ 1. Our main
results hold no matter whether A, B and C are collinear or not. 2. In
theory, the optimization objectives (Eq. 3 and Eq. 5) are convex and
therefore one can always find the optimal solution to them for any input.
3. In practice, we traded optimality for scalability in our recovery
algorithm [line 202209] and solved the relaxed version (Eq .7) of the
original objective (Eq .3). Note that the difference between relaxed
objectives (Eq. 7) and (Eq. 3) is determined by A+B+C but not
the correlations or coherences among A, B and C. Therefore, the quality of
the reconstruction is not related to the coherences or correlations
between A, B and C. The case of noisy recovery is also similar.
To
Reviewer 3:
(*) Due to space limitations, we excluded a discussion
how a pairwise interaction tensor differs from a general tensor in terms
of degrees of freedom. In fact, the degree of ``lowrank’’ pairwise
interaction tensor (i.e. A,B,C are low rank matrices) is comparable to
that of lowrank (tensor rank) tensor. We will try to add a brief
discussion within the space limitations.
(*) We believe that there
are no significant technical gaps to generalize our theory to higher order
tensors. We did not pursue that direction in this paper in order to avoid
the complicated notations.
(*) Since nuclear norm dominates
Frobenius norm, our error bound implies an error bound in terms of
Frobenius norm.
(*) We have mentioned the storage costs in line
300301. Indeed, we only need to store 1) all observed entries of the
tensor and 2) three low rank matrices X,Y,Z.
(*) We will include a
reference to MovieLens dataset (which is a public dataset).
(*)
This is a typo. It should be xaxis rather than yaxis. The yaxis is rank
r. We will adjust the font sizes in the figures.  In the noisy cases,
in general, the error cannot be _zero_ even if all entries are observed
with noise, since it is impossible to _exactly_ recover the true
matrix/tensor (for example, an adversary can always add a small energy,
low rank perturbation to A,B,C). More specifically, Candes and Plan, 2010
[Section III.B and Section IV, 4] discussed a related case, under the
context of noisy matrix completion. They showed the optimal bound, which
is achievable with oracle knowledge, is roughly p^{1/2}\delta, where p is
the fraction of observed entries and \delta is the noise energy on
observations according to their notations. Therefore, even all entries are
observed (p=1), the optimal error bound is not zero.  RMSE is not
normalized and it is just coincidence that the largest RMSE is around 1.0
(*) There is a missing requirement: \epsilon_2 \ge \epsilon_1 in
the statement of Theorem 2. We indeed require epsilon_2 \ge epsilon_1 in
our argument. Otherwise, the true solution (A,B,C) itself will not be a
feasible solution to Eq. 5 (suppose, if one does not allow error, the true
solution itself would not match with the noisy observations). Our proof of
Theorem 2 has already used this condition and that the true solution is
feasible (line 756, supplementary: \hat M \le M). We will add this
missing requirement to the final version of the paper. We note that such
requirement is also needed in noisy matrix completion literature [4,8].
(*) This is a typo, 26\mu_0r/n_1n_2 should be removed.
 