Paper ID: 856
Title: Exact and Stable Recovery of Pairwise Interaction Tensors
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper analyzes the sample complexity of pair wise interaction tensors and provides a scalable optimization algorithm based on SVT algorithm.

Overall, the paper is written well. The problem is motivated and the theorems are presented clearly. There are only a few minor issues.

In line 234, $\delta_i$ is used without introduction. Same issue appeares in the statement of Theorem 4 in appendix where $W$ is not explained.

In the temporal collaborative filtering experiment, trace norm related algorithm for Movie-lens dataset usually works relatively well. It is surprising to see the algorithm "MC" only get 0.92 Test RMSE. It could be that the mean rating values are not removed before MC algorithm. Removing mean values is a standard preprocessing approach. The RPIT algorithm may still outperform MC but not in a significant way as the temporal effect for recommendation system is usually relatively small (see experiments in paper [1] for example).

[1] Liang Xiong, Xi Chen, Tzu-kuo Huang, Jeff Schneider, Jaime G. Carbonell. Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization. SIAM Data Mining 2010.
Q2: Please summarize your review in 1-2 sentences
The paper provides theoretical analysis of sample complexity of pair-wise interaction tensors. The overall proof is clear and the algorithm for solving the problem is efficient.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposed a new tensor factorization model and interesting applications. Each entry of data tensor is expressed by a sum of three scalar products. The authors propose a new expression to analyze the pairwise interaction tensor. The model is related to special cases of BTD (Block term decomposition), SOD (slice oriented decomposition) and KTD (Kronecker tensor decompositions) The link between this specific model and existing more genral and flexible models including CPD, BTD, SOD, and Ticker models should be discussed. The authors consider only 3-rd tensors.

Minor comments

- Line 69 needs an additional reference:
B Recht, M Fazel, PA Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, arXiv, 2007,
SIAM Review 52 (3), 471-501, 2010.

since Recht et. al. appeared online in 2007 before Gross et. al 2010.
- Proposition 1. Not any T.
T should have condition, or should be centralized due to condition
1^T A = 0, 1^T B = 0, and 1^T C = 0.

- Lines 130 -144: Discussion on coherence needs references.

- Line 234, delta_i is never defined in the paper.

- Line 269, why is the shrinkA much different from those for B and C?

- Experiments
If several components in A, B and C are highly collinear, the tensor is hard to be decomposed, and we may not obtain the optimal solution.
This is important issue that the authors do not address at all in the simulation section.
Note that simulations on general random factor matrices are often good in practice.
The authors need to prove their algorithms for difficult scenarios, not for the normal cases.
Since the model can also be expressed in form of CPD (CANDECOMP/PARAFAC) and most of algorithms for CPD can be also employed.

The authors missed to compare the proposed algorithm with CPD algorithms such as
the fast damped Gauss-Newton algorithms, ALS with line search:

A.-H. Phan, P. Tichavsk´y, and A. Cichocki, “Low
complexity damped Gauss-Newton algorithms for CANDECOMP
/PARAFAC,” SIAM J. Matrix Anal. Appl. vol 34(1),
pp. 126–147, 2013.

L. Sorber,M. Van Barel, and L. De Lathauwer, “Optimizationbased
algorithms for tensor decompositions: Canonical
polyadic decomposition, decomposition in rank-(lr,lr,1) terms
and a new generalization,” Tech. Rep., University of Leuven,
2012.

The error bound for decomposition of missing data can be found in
Petr Tichavský, Anh Huy Phan, Zbynek Koldovský, Cramer-Rao-Induced Bounds for CANDECOMP/PARAFAC tensor decomposition , IEEE, Transaction on Signal Processing, pp. 1986-1997, vol. 61, no. 8, 2013.
Q2: Please summarize your review in 1-2 sentences
This an interesting and a v. good paper with strong theoretical results and convinced simulations. Theoretical results are proved rigorously.

Th authors demonstrated that using their tensor model and approach they can obtain performance comparable with best the state of the arts methods

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
(*) The paper provides algorithms with recovery guarantees for the special problem of pairwise interaction tensor completion.
The algorithms use a convex optimization formulation. Results are proven both for exact recovery, and for recovery in the presence of noise.
In addition, there are also experimental results on simulated and real datasets, demonstrating the success of tensor decomposition in
predicting unobserved entries.


(*) Overall, the paper is original, provide a significant contribution to a hard problem of tensor decomposition, and looks technically impressive
(I could not verify all the proofs, they are quite involved). It is also written quite clearly. The theoretical results showing robustness to noise,
and the encouraging experimental results show that the author's approach has great potential.

(*) It is not clear how restrictive is the requirement to factor a tensor as a pairwise interaction tensor (compared to more general tensors)
A better explanations on this restriction is needed. (e.g. dimension, number of parameters needed, how much would one 'lose' if the true decomposition
is more general, yet one uses the pairwise interaction decomposition?).

(*) One weakness is that the authors present all of their results for 3-dimensional tensors. This is fine for sake of presentation, but it is
not clear how do the authors results generalize to higher-order tensors. Is this all straightforward, or would it be hard to extend the techniques used?
a discussion of the applicability of the techniques to higher-order tensors should be added.

(*) Another possible concern in the bounds is that the reconstruction guarantees provide bounds on the nuclear norm of the difference between the true tensor and the reconstructed
tensor (e.g. Theorem 2). It would have been better to obtain a bound on a more natural L2 norm (Mean-squared error). Can this be deduced by the methods used by the authors?

(*) The authors discuss the computational complexity of their algorithm (of each iteration) in page 6, lines 292-304. Would be also good to include
the memory requirements (large high dimensional tensors may be to big to hold in memory. Is the entire tensor need to be stored, or each time only a small amount of data?),
and the number of iterations required to converge (i. are there theoretical guarantees on the error in the convex program as a function of the number of iteration?
ii. How many iterations were required for the experimental examples?).

(*) It looks like there are many implementation details which are important for allowing scalability of the author's algorithm. It is nice that the authors
give more details of the algorithms in the appendix, but it would be better to also provide the author's software implementation available for the community.


(*) The authors should provide a reference for the movielens dataset used. Where was it downloaded from? (url? list a reference?), or alternatively,
make the data available.

(*) The description of the experimental results (Figure 1) looks wrong. In the text (lines 319-320) it is said that the y-axis is the ratio m/d,
but this seems to be the x-axis (hard to tell, the axis labels are too small). What is the y-axis? the rank r? (also too small).
Also, it is not clear to me why is recovery error stable as the number of observations is increased (table 1 (b)). I would expect the error
to go to zero as more and more measurements are added (even if these are noisy measurements).
In Figure 2 - what are the units? is RMSE somehow normalized to 1? (or maybe just coincidence) were the movie ratings normalized somehow?

(*) I don't understand theorem 2 (robustness to noise). It is stated that if the measurement error is \epsilon_1, and one solves an optimization problem
allowing error \epsilon_2, then the overall reconstruction error is bounded (in probability) by \epsilon_1+\epsilon_2. If this is true, why wouldn't
one want to decrease \epsilon_2 as much as possible (even take \epsilon_2=0?) This will reduce the reconstruction error! however, intuitively, it seems to
me that under measurement noise you want to allow some noise \epsilon_2 and not take it to zero, otherwise you may increase the dimension, or fail to satisfy
the constraints. So how is this consistent with the bound given by the authors?

(*) Some of the proofs steps are unclear. For example, in page 7, lines 335-341: why and how did the authors go from 26 \mu_0 r / n_1 n_2,
to an expression involving also division by n_1 n_3, and then back to 28 \mu_0 r / n_1 n_2?

Minor:
Page 2, line 88: 'show a quadratically ..' -> 'show that a quadratically ..'

Page 3, line 146: 'sampled from' -> 'sampled'

Page 4, line 206: 'recover algorithm' -> 'recovery algorithm'

Page 5, line 238: 'be the adjoint' -> 'the adjoint'
Q2: Please summarize your review in 1-2 sentences
Overall, the paper is original, provide a significant contribution to a hard problem of tensor decomposition, and looks technically impressive
(I could not verify all the proofs, they are quite involved). It is also written quite clearly. The theoretical results showing robustness to noise,
and the encouraging experimental results show that the author's approach has great potential.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Thank you for your comments.

To Reviewer 1:

1. $\delta_i$ is the Kronecker delta. The definition of $W$ should be $\diag(U_AV_A^T, U_BV_B^T,U_CV_C^T)$ (a similar variable $\W^\orth$ is defined on [line 232, supplementary]). We will add the definitions.

2. In theory, matrix completion algorithm (MC) does not require the mean of input matrix to be removed, since it is expected to ‘recover’ the matrix no matter whether the input matrix is already centered or not. Similarly, in a quantitative study of matrix completion algorithms by [Keshavan et al. 2009], the authors used the raw ratings of MovieLens 100K (note: we used MovieLens 1M) directly without removing mean value as the input of several matrix completion algorithms.
- In addition, the input matrix for the compared RPIT algorithm is also not centered, which ensures that the comparison of RPIT and MC is fair in the sense that they have the same input.

[Keshavan et al. 2009]: Keshavan, Raghunandan H., Andrea Montanari, and Sewoong Oh. "Low-rank matrix completion with noisy observations: a quantitative comparison." Allerton 2009.


To Reviewer 2:

- About the citation on line 69. We cited the work of Gross et al., 2010 [8] since their result and proof technique are the most closely related. The recovery result of Recht et al., 2007, on the other hand, requires the observation operator to satisfy r-RIP (restricted isometry property), which does not hold for the observation operators we dealt with in our scenario in general.

- Proposition 1 indeed holds for any pairwise interaction tensor T. In fact, Proposition 1 shows that for any T=Pair(A,B,C), one can transform it to a (unique) ``centered’’ version Pair(A,B,C)=T=Pair(A’,B’,C’), where A’,B’,C’ satisfies the centering conditions. (line 121-125). The transformation method is given in the proof of Proposition 1 [Section 4, supplementary].

- We will add the reference for coherence (e.g. Candes and Recht, 2009 [5]). Thanks for pointing out.

- $\delta_i$ is the Kronecker delta. We will add the definition.

- The difference among computation methods of shrink_A, shrink_B and shrink_C inherits from the definitions of S_A, S_B and S_C, which are slightly different (the definition is at line 123-125).

- On the concern ``If several components in A, B and C are highly collinear, the tensor is hard to be decomposed, and we may not obtain the optimal solution. ’’
1. Our main results hold no matter whether A, B and C are collinear or not.
2. In theory, the optimization objectives (Eq. 3 and Eq. 5) are convex and therefore one can always find the optimal solution to them for any input.
3. In practice, we traded optimality for scalability in our recovery algorithm [line 202-209] and solved the relaxed version (Eq .7) of the original objective (Eq .3). Note that the difference between relaxed objectives (Eq. 7) and (Eq. 3) is determined by ||A||+||B||+||C|| but not the correlations or coherences among A, B and C. Therefore, the quality of the reconstruction is not related to the coherences or correlations between A, B and C. The case of noisy recovery is also similar.

To Reviewer 3:

(*) Due to space limitations, we excluded a discussion how a pairwise interaction tensor differs from a general tensor in terms of degrees of freedom. In fact, the degree of ``low-rank’’ pairwise interaction tensor (i.e. A,B,C are low rank matrices) is comparable to that of low-rank (tensor rank) tensor. We will try to add a brief discussion within the space limitations.

(*) We believe that there are no significant technical gaps to generalize our theory to higher order tensors. We did not pursue that direction in this paper in order to avoid the complicated notations.

(*) Since nuclear norm dominates Frobenius norm, our error bound implies an error bound in terms of Frobenius norm.

(*) We have mentioned the storage costs in line 300-301. Indeed, we only need to store 1) all observed entries of the tensor and 2) three low rank matrices X,Y,Z.

(*) We will include a reference to MovieLens dataset (which is a public dataset).

(*) This is a typo. It should be x-axis rather than y-axis. The y-axis is rank r. We will adjust the font sizes in the figures.
- In the noisy cases, in general, the error cannot be _zero_ even if all entries are observed with noise, since it is impossible to _exactly_ recover the true matrix/tensor (for example, an adversary can always add a small energy, low rank perturbation to A,B,C). More specifically, Candes and Plan, 2010 [Section III.B and Section IV, 4] discussed a related case, under the context of noisy matrix completion. They showed the optimal bound, which is achievable with oracle knowledge, is roughly p^{-1/2}\delta, where p is the fraction of observed entries and \delta is the noise energy on observations according to their notations. Therefore, even all entries are observed (p=1), the optimal error bound is not zero.
- RMSE is not normalized and it is just coincidence that the largest RMSE is around 1.0

(*) There is a missing requirement: \epsilon_2 \ge \epsilon_1 in the statement of Theorem 2. We indeed require epsilon_2 \ge epsilon_1 in our argument. Otherwise, the true solution (A,B,C) itself will not be a feasible solution to Eq. 5 (suppose, if one does not allow error, the true solution itself would not match with the noisy observations). Our proof of Theorem 2 has already used this condition and that the true solution is feasible (line 756, supplementary: ||\hat M|| \le ||M||). We will add this missing requirement to the final version of the paper. We note that such requirement is also needed in noisy matrix completion literature [4,8].

(*) This is a typo, 26\mu_0r/n_1n_2 should be removed.