Paper ID: | 4229 |
---|---|

Title: | Stein Variational Gradient Descent With Matrix-Valued Kernels |

The paper is clearly written: both precise and easy to read. The main idea is very simple: remove the assumption that we equip the vector space of vector-valued functions H^d with the standard inner product (or equivalently standard kernel), i.e., take into account the vector components could be "correlated". While this is simple it does add clarity to the theoretical setting of SVGD in which it is assumed, but not stated, that the components are independent. As the authors show it is important to note H^d is a vector-valued RKHS since coordinate transformation (which should not affect experiments) induce non-standard matrix kernels. While this provides the natural theoretical framework, it is raises the question of which matrix kernel to choose and I think the authors do not really answer this important (but probably hard) question. Indeed it seems to me that the natural way to choose the matrix kernel is using some intrinsic geometric information (coordinate independent), but the authors explain themselves that this leads to expensive computation. It is not clear to me why the "mixture preconditioning kernel" would work better in general.

*** UPDATE *** Thank you for your response. This has clarified some of my comments, and thus I have increased my score by one level. However, a negative aspect from the author response was the excuse given for the absence of a comparison against SVN; "the results we obtained were much worse than our methods and baselines, and hence did not investigate it further". To me this seems like a very obvious thing to include and discuss in the paper, lending strong support to the new method, so I am a bit suspicious about why it wasn't included. I hope the authors will include it in the revised manuscript, if accepted. Another reviewer commented on the lack of wall-time comparison, and again I feel that the authors did not give a valid excuse for omitting this information from the manuscript (the reader can, I think, be trusted to adjust for different computational setups and hardware when interpreting the wall-time data). This paper generalises the Stein variational gradient descent method to the case of a matrix-valued kernel. Various matrix-valued kernels are proposed in this context and they are empirically studied. The idea is novel and the method is interesting enough. The main criticisms I have about this work are the limited nature of both the empirical assessment and the description of the empirical assessment itself. In particular, what I would consider to be some key empirical comparisons haven't been included and key implementational details needed to interpret the results aren't provided. Code is provided, but the reader should not be expected to reverse-engineer the experiments from the script. In this respect, a minimal standard for reproducible scientific writing has not been met. l14. The authors claim that distributional approximation is somehow more difficult than optimisation. This discussion seems fluffy and not well-defined, since of course SVGD is also an optimisation method. I would just remove l14-16. l39. "extends" -> "extensions" l55. The authors should remove the word "complex". l61. "unite" -> "unit" l61. "the" unit ball is not unique - it depends on the choice of norm on R^d. The norm on R^d being used should be stated. The absence of precision on this point has the consequence that it is also unclear how ||.||_{H_k^d} is being defined. l120. The authors mention that "vanilla SVGD" is a special case of their method. Is R-SVGD a special case of their method? If not, this should be acknowledged. l130. I may be wrong, but should the "+" be a "-"? Otherwise it looks like gradient ascent instead of descent. l152. The authors could note that the use of constant preconditioning matrices in the context of kernels and Stein's method was considered also in [CBBGGMO2019]. l207. A reference is needed for the median trick. l210. The authors write that in the experiments Q was "either" the average Hessian or the Fisher information matrix of the particles. This is not good enough - the authors need to be precise about what was used for what experiment. l211. For each experiment, the number of particles / anchor points used to form the mixture preconditioning matrix kernel in Matrix-SVGD is not stated. It is also not stated how the anchor points were selected. The effect of these factors on the results was not explored. As such, I cannot properly interpret any of the empirical results for Matrix-SVGD. l216. The authors need to state what kernel was used to compute the MMD reported in Fig. 1. (This should not be the preconditioner kernel used for Matrix-SVGD, as this would give the proposed method(s) an unfair advantage.) l216. It appears that competing methods may not be initialised from the same initial point set. Indeed, in Fig. 1(f) very different values of MMD are reported when the number of iterations is small. To avoid doubt, the authors should carefully say what initial configuration(s) of point sets were used. l223. Why was SVNM (probably the most closely related existing method to the Matrix-SVGD method being proposed) not included in any of the three "real" empirical experiments (4.2, 4.3, 4.4)? l240. The authors do not state whether mini-batching was used in 4.3 and 4.4. If it was used, it should be stated, and the corresponding implementational details provided. l240. Neither the number of iterations, nor the learning rate are provided for 4.3 nor 4.4. This is basic and important information, and I can't understand why this was not included anywhere in the manuscript or supplement. lA339. The interchange of expectation and inner product should be justified. What assumption(s) are required? lA340. The authors just "cite a book" in the proof of Lemma 2, whereas a pointer to the specific result in the book (and checking of any preconditions) is needed. [CBBGGMO2019] Chen WY, Barp A, Briol FX, Gorham J, Girolami M, Mackey L, Oates CJ. Stein Point Markov Chain Monte Carlo. International Conference on Machine Learning (ICML 2019).

The paper provides a nice generalization of SVGD. It does this by optimizing the KSD over a different unit ball of functions--instead of using the direct sum of d copies of an RKHS with scalar-valued kernel, it considers the unit ball of vector valued functions that arise from considering an RKHS associated with a matrix valued kernel. This allows one to better capture local information in the density functions, as evidenced in the experiments. The authors also point out how this variant of SVGD reduces to many other variants of SVGD as a special case. L102: There looks to be a typo. Do you mean to say the vv-RKHS is equivalent to the direct sum of d copies of H_k? Experiments 1 & 2: Is there any way to compare the algorithms on CPU time rather than iterations? The matrix valued SVGD will take longer per iteration, and it would be nice to see how these approaches compare as a function of time. Originality: The paper produces a generalization of SVGD that captures many other already known variants. It also provides a computationally feasible version that is useful in the experiments shown. Quality: The paper appears to be correct. Clarity: The paper is very well written and easy to follow. It was a nice read. Significance: The paper is a nice contribution to the topic of SVGD and provides some ideas for incorporating more information about the score function.