Paper ID: | 4340 |
---|---|

Title: | Solving Interpretable Kernel Dimensionality Reduction |

The listing of 14 references in the first sentence (without providing any information about them) is unnecessary. “For supervised dimension reduction, we perform SVM on XW using 10-fold cross validation after learning W” This is a clear problem in the experiments. The generalization of the DR method should be tested on out of sample data. Since the final mapping is a linear projection Z = X*W, comparisons should also include linear subspace learning methods. Comparisons should also include the use of kPCA (or approximation using Nystrom method) for the data projection.

The paper extended the theoretical guarantees of ISM to a family of kernels beyond the Gaussian kernel. The paper also showed that a conic combination of ISM family of kernels stays as a valid ISM kernel. Pro: 1. ISM can be utilized as an efficient solution to a variety of IKDR learning paradigms. 2. Experiments on both supervised and unsupervised learning paradigms and for a variety of kernels showcase the generality of ISM. 3. verify the improvement in speed offered by ISM as an optimization algorithm for solving IKDR compared to competing methods and with better or comparable accuracy. Cons, the novelty seems not high. 1. the method and algorithm seem straightforward. 2. experiments are more toy

Summary: [19] has proposed recently an efficient iterative spectral (eigendecomposition) method (ISM) for the non-convex interpretable kernel dimensionality reduction (IKDR) objective in the context of alternative clustering. It established theoretical guarantees of ISM for the Gaussian kernel. The paper extends the theoretical guarantees of ISM to a family of kernels [Definition 1]. Each kernel in the ISM family has an associated surrogate matrix \Phi and the optimal projection is formed by the most dominant eigenvectors of \Phi [Theorem 1 and 2]. They showed that any conic combination of the ISM kernels is still an ISM kernel [Proposition 1] and therefore ISM can be extend to conic combination of kernels. They further showed how a range of different learning paradigms (i.e. supervised, unsupervised, semi-supervised dimension reduction and alternative clustering) can all be formulated as a constraint optimisation problem (equation 1) which can be solved by ISM. Hence the realisation of ISM’s potential through the generalisation into a wider range of kernels impacts a wide range of applications. The performance of the ISM using a range of kernels from the ISM family is illustrated in 5 real data experiments under different learning paradigms. Quality: The theoretical claims are well supported by theoretical analysis. A variety of experimental results are presented showcasing the application of the ISM algorithm on a variety of learning paradigms using a variety of kernels. Clarity: Overall, the paper is written with clear structure. However, there are a few places that confuses me. - The paragraph starting line 153, the authors starts to introduce the matrices used and using the language from group theory I believe? Can the authors elaborate on the reasons and connections (if any)? - Can the authors provide some intuitions on the ISM kernels and the associated phi matrix? All the proofs seems to fall magically into place using this phi matrix, so in some sense this phi matrix capture all the informations about the corresponding kernel matrix? Is this correspondence unique? - The matrix A was first defined on p.16 in the proof of Lemma 1, however was first mentioned in Theorem 3. - For the notation in Definition 1 and later the proof of Lemma 1, I am a little confused by what a(x_i, x_j) is? A function with two arguments? Can it be any functions? (I notice the authors discussed when a = x_i, or when a = x_i - x_j, but when do we use which (some derivation of the phi matrix used the former while others used the later)? Some clear explanation is needed here.) Also, what does {\bf a } = a(x_i, x_j ) mean? - In Table 4, MNIST dataset the row with NMI, why does it suddenly change into percentage and why are both bolded? Also, for the wine example, why are NMI with 0.86 and 0.84 all bolded? Originality: The main contribution of the work is the extension of the theoretical guarantees of ISM [19] to a family of kernels and any conic combination of kernels from this family of kernels. While showing that different objective functions from a variety of IKDR problems can be formulated into the same form, the paper illustrates the wider applicability of the ISM algorithm. Though established some new/improved results on the existing datasets, I feel the main contributions of the paper are theoretical. Significance: The realisation of ISM’s potential through the generalisation into a wider range of kernels impacts a wide range of applications. We saw in the experiments, ISM can now be applied to solve a variety of IKDR problems with better computational efficiency comparing to other state of art manifold optimisations. Comparable and sometimes better performance are observed by utilising a different kernel than the Gaussian kernel. Practitioners are likely to use the ideas. ======================= Authors response read. I am happy with the response provided.