Reviews: Kernel Instrumental Variable Regression

Originality: The paper introduces a kernel variant for IV regression that combines 2SLS with RKHSs that is novel, and provides an extensive review of the literature with connections to many fields. Quality: The paper is of high quality, and provides theoretical proofs and extensive experimental results. A more thorough discussion of the limitations of the method would be useful to understand the extent of its applicability. Clarity: The paper is very clearly written and well organized, a pleasure to read. The assumptions are clearly stated. Significance: Extension of IV regression to nonlinear settings is a relevant research topic as many real-world applications involve nonlinear interactions. Thanks to the theoretical proves provided and the efficient ratio for training samples, this work could be used in applied fields in a fairly straightforward way. Comments/questions: • The connections to the kernel embedding of distributions in stage 1is very nice, and there are two papers that I find could be referenced: the initial paper of kernel embedding of distributions: A. Smola et al 2007 – A Hilbert space embedding of distributions, and K. Muandet et al, 2017 – Kernel mean embedding of distributions: a review and beyond, which is a nice review. • Line 16: reference to IV regression • How does KernelIV perform if the data indeed comes from a linear setting, where h(X) is a linear function of X? Does it still outperforms the other methods? • How does KIV perform if the efficient ratio n/m (Line 300) cannot be satisfied? • What kernels did the authors use in the experiments for K_XX and K_ZZ (Alg 1) and how did you choose the parameters of the kernels? Is there a robustness study w.r.t. to these parameters? • A more detailed description of the parameters used in the experiments for the different methods would have been useful, but this was probably a problem of space.

Reviewer 2

Originality: The proposed algorithm, KIV, is novel, to my knowledge and according to the well detailed bibliography. I would have say that all linear model had already been kernalized but it seems to be (was) wrong. The authors have provided a clear comparison of their work woth exisiting approaches, hightlighing the benefits of KIV while recognizing the merits of previous works, which is always more enjoyable. Quality: As the paper presents a not so well-known technic for machine learners, the authors take time to present all required background. Appart from the original idea of kernalizing instrumental variable regression, the paper contains a thorough analysis of its properties, an algorithm, and its practical application, which make the paper very complete. Author suggest that their analysis is somehow restricted (l227) in the sense that one assumption might be too restrictive. All claims are supported either by precise references or by included theorems. Concerning the soundness of maths, I have to admit that some parts are too complicated for me, I might have missed something in particular in proofs. However, to the best of my knowledge, the consistency analysis makes sense. It is not an easy piece to read (many notations and very condensed content - supplementary material is helpful) but I'm quite confident that a reader who would be more familiar with the used technics could follow. Clarity: This paper is clearly one of my top papers I had to review (in general) in term of clarity (ragarding its technical content and the amount of information). The introduction is very accessible despite I had no clue about IV until this review and the related works are not only mentionned but also quite clearly described (which helps a lot for pointing out the advantages of the proposed method). Part 2 begins to be more technical, efforts are made to give a synthetic view of the paper (thank you for figure 1). Part 3 details the learning problem and is for my "reading profile" the most usefull part of the paper. Casting the IV problem in the RKHS settings is elegant and brings several major advantages (such as the fact that some properties are existing by construction - l 183). Part 4 contains the theorems of the paper (at least a very synthetic view of those). It is very structured, presenting clearly assumptions, hypothesis and also gives intuitions of what it says, permitting the non expert reader to go through the reading. The experimental part is maybe the "weak" part of the paper, considering the space constraints. It seems to me that supplementary material containts more explicit results than this part (figures 2 and 3 are unreadable if printed). Significance: I think that KIV can be very significant for practionners, since it opens doors of non linear models. In the ML field, it might not be as significant (except if there is something in technics to prove theorems that I missed) but it is a nice piece of work. However it is linked to the very challenging problem of causality, and on this aspect, it could inspire new ideas. Details (by order of appearance in the document) - l73/74 : the reason why this notion of training sample ratio is not clear at this stage - l 107 : unless I missed it, 'e' is not really defined, although it's not too hard to understand. Taking into account the amount of information, it would be helpful. - l 120 : I did not understand what LHS and RHS are (I guess this sentence is here to make the link between the two world of course, but I'm not sure it is the right place?) - l133 : 'stage' at the end of line and '1' at the begining of the next line -> use ~ - l271 : a.s. ? -fig 2 and 3 : what is 's' in legend (1000s)? - bibliography : I'm annoyed that at least 7 references are not cited in the main document (only in the supplementary material). I'm also annoyed that some groups of papers are cited for the same point, when all papers are clearly from the same persons (ex: ref 24,25 and 26 or 32,33,34, or 22,28 or 39,40, or 46,47). I don't know if it is a good practice or not, but it make a very long bibliography and I don't really know how to choose which one to read for a specific point.

Reviewer 3

== Finite dimensional output RKHS == As mentioned, it seems that the authors implicitly assume that the output RKHS H_X in the first stage (conditional mean embedding) is finite dimensional. I explain this below. - In the proof of Theorem 6 in Appendix, the authors assume that the covariance operator T_1 admits an eigendecompositon in terms of an orthonormal basis. However, unless T_1 is shown to be a compact operator, such a decomposition may not exist; see e.g. Thm A.5.13 of Steinwart and Scovel's book. The problem is that, given that the operator-valued kernel is defined as in lines 165-166 (which results in Algorithm 1), the covariance operator is not of trace-class (and thus is not compact), if the RKHS H_X is infinite dimensional. This can be seen from the proof of Proposition 21, which is given in [17, Proposition 13] where the output RKHS is essentially assumed to be finite dimensional (this can be seen from [17, Eq 71], where the middle term becomes infinity if the RKHS is infinite dimensional and the operator-valued kernel is defined as in lines 165-166). Therefore, the results in the current paper does not hold unless H_X is finite dimensional. This should be clearly stated and discussed. - In Hypothesis 5, a power of the covariance operator is introduced, but this may not be well-defined. This is because, to define the power, one needs to introduce an eigendecomposition of the covariance operator, but this may not well-defined if H_X is infinite dimensional, as mentioned above. Anyway the authors should explicitly state the definition of the power operator somewhere in the paper or in the appendix. - From what I remember, Grunewalder et al [25], who borrowed results from Caponetto and De Vito [11], essentially assume that the output RKHS for conditional mean embedding is finite dimensional, from the same reason I described above. This is why I suspected that the current paper also implicitly assumes the output RKHS being finite dimensional. Other comments: - In Theorem 2, the optimal decay schedule of the regularization constant should be explicitly stated. - Where is the notation \Omega_{\mu(z)} in Definition 2 and Hypothesis 7 defined?

Paper ID:	2580
Title:	Kernel Instrumental Variable Regression

Reviewer 1

Reviewer 2

Reviewer 3