Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
I think this paper is well prepared. I have the following comments. (1) I think the most important contribution is in theoretical aspect. Nevertheless, I cannot judge the proposed results is based on the previous works or these results in developed by the authors themselves. Thus, I suggest the authors to highlight their unique contributions. (2) I notice one assumption is that k clusters have equal size. I think this is too strict for real application. Can the authors give more insights about the results when this assumption does not hold? (3) As stated in Section 5, the authors have also touched the computational issue. I have not found some numerical comparison. I suggest the authors to added some numerical comparisons.
Originality: The paper is an extension of  to apply the Power Mean Laplacian to semi-supervised learning. The paper provides some insights about the new algorithm for the choice of p, depending on the noise assumption on the graphs. Quality: The theorems in the paper is insightful for understanding the algorithm. The provided numerical analysis is also convincing to demonstrate the theoretical results. The proposed method for solving the linear system efficiently also makes the algorithm more practical. However, I’m not sure if the empirical results were convincing enough as the authors claimed they didn’t tune to the parameters for all algorithms. Some of the baselines gives pretty terrible results, e.g. SMACD, which contradicts with the original paper. Clarity: The paper was written clearly. I can follow the paper pretty easily. There are a few typos I found in the paper: Page 2, line 71, “Moreover, not that” should be “note that” Page 5, line 183, “its overal performance is larger than the proposed approach”. overal -> overall, performance -> error? Significance: The proposed algorithm seems an interesting direction for semi-supervise learning. Even though the paper provides some efficient method for solving the linear system, the complexity of the algorithm was still pretty high, which might hinder the wide adoption of the algorithm.  A. Subramanya and P. P. Talukdar. Graph-Based Semi-Supervised Learning. Morgan & Claypool Publishers, 2014.
The paper discusses how to solve semi-supervised learning with multi-layer graphs. For single-layer graphs, this is achieved by label regression regularized by Laplacian matrix. For multi-layer, the paper argues that it should use a power mean Laplacian instead of the plain additive sum of Laplacians in each layer. This generalizes prior work including using the harmonic means. Some theoretical discussions follow under the assumptions from Multilayer Stochastic Block Model (MSBM), showing that specificity and robustness trade-offs can be achieved by adjusting the power parameter. I am not fully understanding or convinced by the Theorem 1. Particularly, it claims zero test error under stochastic models, which is either wrong or suggesting that the method may be impractical. Further, Corollary 1 and most of the experimental results support a negative norm parameter, which is rarely seen. I also checked the additional descriptions on Krylov subspace methods. Krylov subspace methods are important for sparse matrix factorization. However, I found no original contributions on this part from the authors, besides a (nice) simple mentioning. === Post rebuttal === I re-checked the paper. I agree that the paper seems intuitive and it makes incremental contributions. I will adjust my score to weak accept, with the following notes: The main contribution is a SSL method that is "provably robust against noise ... as long as one layer is informative and remaining layers are potentially just noise" as a limiting case. The intuition is that when p -> -inf, the smallest eigenvalues of the informative layers will dominate the other presumably larger eigenvalues of the noisy layers. The intuition is further extended to a series of consistency theorems using graph Laplacians in expectation, which is a limiting case given infinite amount of edge observations. While the intuition for p -> -infty is obvious, I would love to see more discussions around p = -1 case. The p=-1 case is included in Theorem 1, but not discussed in Corollary 1. Particularly, the assumption that "one layer is informative and the remaining layers are potentially just noise" no longer holds. Neither is p=-1 discussed in Section 4.2, because Figure 4 shows equally good results with p from -10 to 1, i.e., not specific to p=-1. The main evidence for p=-1 case is from the real-data experiments, but with insufficient discussions. The Krylov subspace methods need to be presented more clearly. Does it fundamentally change the orders of complexity or not?