Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
==================================== After rebuttal I thank the authors for their reply, they have managed to clarify some of my concerns and overall I vote for acceptance of the paper. ============= This paper introduces a boosting method for transfer learning with instance re-weighting in the setting where labeled data are available in both training and target tasks. Theorem 1 provides a bound for the population error on the target task, and motivates four instance re-weighting ``principles''. A practical procedure is introduced, which achieves competitive results on two standard datasets for transfer learning. Novelty: To my knowledge, the theoretical analysis carried out by the authors in the context of fully labeled data is novel. In particular, I am not aware of previous bounds including the performance gap from Def 1. I did not, however, go through the proofs in the Appendix in detail. Significance: I find the results interesting, and found the bound in 2) insightful, as well as the resulting algorithm. The empirical results are also strong, and present an advantage over competing methods (though small for TransferBoost). Clarity: The paper is overall well structured and the mains ideas are easily conveyed. However, I find that improving the presentation of gapBoost in Section 3.3 would benefit the paper. The algorithm is rather obscure at first glance, and the actual procedure is not explained in the main text (only a very high level view). Key elements, such as \rho_S and \rho_T, are not really defined, and lead to confusion on a first read since the notation overlaps with the Lipschitz constant. The procedure for re-weighting the new weights at each iteration based on the performance of the different learners is not explained in the text. Moreover, it is hard to understand how the four principles should be interpreted in Section 3.1, since they contradict each other (as the authors point out), and what the authors want to achieve is only made clear after reading the whole Section. I believe the trade-off could be made clearer, perhaps after the results from 3.2, and that the main algorithm could be described in more detail. In addition to the previous comments, I have the following concerns. 1) What can we expect when linearity does not hold? Linearity is assumed for the Theorem (line 115), and in the experiments (line 252). However, the method is presented as general, see for instance line 207. Have the authors experimented with this, and what can we expect from the theory? 2) An interesting aspect of the algorithm is the trade-off between different values of \rho_S and \rho_T. Is there a principled way to select them, and would it be possible to have them vary together with the instance weights? Why is \rho_T=0 in the experiments? It seems that at least a simple heuristic based on the number of examples available in each task could be derived. Finally, are \rho_S and \rho_T fixed as the ratio of training examples increases in Fig 1? 3) Often, we do have unlabeled data in the test set. How could we make use of these data in the proposed theory/algorithm? As a minor comment, why is the notation for the weights changed from \Gamma to D in the algorithm?
1. Originality. The authors propose a novel measure to evaluate domain discrepancy; the idea is interesting and novel. 2. Quality. The manuscript is well-written. 3. Clarity. The manuscript provides methods and detail analysis. However, as the experimental settings are semi-supervised based, and the baseline methods are a little out-of-date. I suggest the authors add more baselines to demonstrate the effectiveness of the proposed method. An example is listed as follows. Zhang L, Zuo W, Zhang D. LSDT: Latent sparse domain transfer learning for visual adaptation[J]. IEEE Transactions on Image Processing, 2016, 25(3): 1177-1191. ================ After rebuttal Although the baseline methods are all boosting-based, I think the performance is an important factor for a good manuscript. ================ 4. Significance. The quality is good, and the technical significance is high.
I would consider Theorem 1 and analysis behind it as a main contribution of the paper. Even though, the bound has too many "moving parts", the trade-off between relevant quantities seems to make sense. One caveat is L2 norm of the weights: when inequality ||Γ||_2 <= sqrt(N) ||Γ||_∞ is tight (for sufficiently large N), the bound of Theorem 1 appears to be vacuous (it is multiplied by the term of order sqrt(N)). Line 182: "...||Γ||_∞ << 1/N_T, which implies that transfer learning has a faster convergence rate than single-task learning" -- this is not entirely true, as the bound is also controlled by the discrepancy, which can be large enough for any source sample size. == post-rebuttal reply I agree with the authors (and raise the score): the rate seems to be fine (the crucial point here is that ||Γ||_2 is squared). One additional note: the bound seems to hold only for a fixed Γ. It should be okay for a general intuition, but to for the design of an algorithm which minimizes the bound simultaneously in h and Γ simultaneously, we should have a bound uniform in Γ (or at least a union bound). Authors could discuss this in a final version.