
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The present paper focus on kernelbased multitask learning. More specifically, it deals with learning the output kernel which defined over multiple tasks. Starting from the vectorvalued RKHS formulation of the multitask learning problem, the authors propose an output kernel learning algorithm that are able to learn both the multitask learning function and task dependencies. The authors first provide an optimization formulation of the problem which is jointly convex. Then they show that this optimization can be solved without the positivesemidefiniteness constraint
of the output kernel, which is computationally costly, and propose the use of a stochastic dual coordinate ascent method to solve it.
Experiments on multitask data sets and comparison in terms of performance and running time with previous multitask learning algorithms are provided.
The paper builds upon recent studies on multitask learning and output kernel learning. In particular, it is closely related to: [1] Dinuzzo & al., Learning output kernels with block coordinate descent., ICML 2011. [2] Zhang & Yeung, A convex formulation for learning task relationships in multitask learning, UAI 2010.
While these two references are clearly mentioned in the paper, there are other closely related papers which are not discussed and cited. For example [3] Kadri & al., Multiple operatorvalued kernel learning, NIPS 2012. [4] Scalable matrixvalued kernel learning for highdimensional nonlinear multivariate regression and Granger causality, UAI 2013. [5] Ciliberto & al., Convex learning of multiple tasks and their structure, ICML 2015.
These papers, in particular [4,5], needs to be cited and discussed. The proposed approach needs to be compared to them from a theoretical and experimental perspectives.
The contribution is somewhat limited. The main novelty is the dual formulation of the optimization problem which does not need the positivesemidefiniteness constraint of the output kernel. However, it should be noted that even in the previous works [1] and [2] where an alternating optimization scheme is adopted, positivesemidefiniteness constraint is not needed. Incorporating the solution of the minimization over the multitask learning function in the optimization problem of the output kernel learning (without the positivesemidefiniteness constraint) provides a positive semidefinite matrix over tasks. This is clearly mentioned in [1] (see Eq. 7) and [2]. The way this paper is written gives the impression that it is the first that avoid the optimization over S_+^\top.
Also, the result about the joint convexity of the output kernel learning optimization problem is not new. This is largely discussed in [5] and also in [2]. It is true that [5] is a newly published paper but the contribution of the present work
will be more visible by taking into account recent results in this field.
In the experiments, it is not clear why the proposed approach is compared to OKL in terms of running time and not in terms of performance. The paper is about output kernel learning, so the performance of the proposed method needs to be compared to those of [1, 4, 5].
update: The authors did a good job in answering my questions. My main concern is about the presentation of the results and their novelty. I think that is technical novelty and originality in the proposed approach, however the authors should better explain this and make it clear what the novel contributions of the paper are in comparison to what is known.
Q2: Please summarize your review in 12 sentences
The paper provides an output kernel learning algorithms that avoid the positivesemidefiniteness constraint of the output kernel. The paper provides some interesting results and a new formulation of the output kernel learning problem. The writing can be improved. The contribution is somewhat limited. Some closely related references are missing.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This is a wellwritten paper that introduces a novel method for multitask learning. The mathematical derivation of the resulting optimization problem and the algorithm used to estimate the parameters are well explained. The usefulness is supported by several experiments on multitask learning and multiclass classification problems.
However, I have one major concern w.r.t. the method that is introduced. Despite the good performance in experiments, I am not really convinced of the formulation in Eqns. (2) and (3). Those formulations do not allow for a separate regularization mechanism for tasks and parameter vectors of a single task, respectively. More specifically, it is known in the field that two different regularization mechanisms have different effects in multitask learning. Adding a regularizer for the parameters of a single task prevents overfitting, as in traditional classification or regression settings.
However, including a second regularization mechanism that shrinks models, so that related tasks have similar models, is known to improve the predictive performance (see several references from the reference list). Many existing methods for multitask learning adopt this principle, in which two different regularization parameters are tuned in a validation step. The method of the authors does not have this flexibility, so theoretically I would expect that it is inferior w.r.t. predictive performance, compared to existing methods. The issue is not at all discussed in the paper, and I think this is an important point. It should be discussed why formulation (2) is a particularly good approach for tackling multitask learning problems.
Q2: Please summarize your review in 12 sentences
This is an interesting paper, but I have one major concern about the proposed methodology (see below).
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The main contribution lies in Section 3, where an efficient optimization procedure is presented to solve a convex subproblem on a PSD cone. Some notations are confusing. For example, the variable 'k' is used to denote the kernel function in Section 2, but in Section 3 it is also used to define p in Theorem 3 and define a subscript in Theorem 4. So it is better to use different symbols in different places.
In the experiments on the SUN397 dataset, the authors are encouraged to compare with other MTL methods instead of just the MTLSDCA method.
Q2: Please summarize your review in 12 sentences
This paper presents a new optimization method for some existing works [7,14] when using some appropriate regularizers for a task covariance matrix.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all reviewers for their feedback. We
highlight our main contributions
 using the theory of positive
definite kernels we show that the constraint that the output kernel is
positive semidefinite (psd) can be dropped in our multitask learning
formulation and derive an unconstrained dual optimization problem (line
212) which we solve efficiently via dual coordinate ascent. We have an
average speedup of 725 compared to OKL depending on the dataset. We
characterize the class of regularizers for which dropping the psd
constraint is possible (Theorem 5). This is a highly original contribution
and could also lead to more efficient algorithms in other
domains.
 we show that the pnorm regularizer for the output
kernel with small p leads to better prediction performance (Table 3+4) and
also better interpretability of the output kernel (Figure 1) as spurious
correlations between tasks are eliminated.
Detailed Comments:
References [R15] below are as used by Reviewer 1. Rest of the
references are as in the paper.
Reviewer 1
 "contribution
is limited as .... previous work [R1] and [R2] need not the positive
semidefiniteness constraint"
[R1,R2,R5] have indeed closed form
expressions for the output kernels in their alternating schemes for fixed
weights. However, computation of these expressions requires a costly
eigenvector decomposition (EVD) (cost O(T^3)). [R1] needs to solve a
Sylvester equation which is solved most efficiently using an EVD (the
authors of [R1] implement it like this in their code). In contrast, we
show that the dual optimization problem is unconstrained and the output
kernel can be obtained by applying *elementwise* a function to all entries
of a certain matrix which requires *no* EVD and is automatically positive
semidefinite. Via the primaldual gap we have then a welldefined
stopping criterion in contrast to the alternating schemes.
 "the
joint convexity of the output kernel learning problem is not
new"
We are nowhere claiming novelty. The main step for the joint
convexity is Lemma 1 for which we cite [2, R2]. In 114119 we discuss
similarities resp. differences to existing work. However, the complete
derivation in this explicit form is not contained anywhere. While [R5] is
similar (this paper appeared just a few days before the NIPS deadline), we
think that our derivation avoiding vectorvalued RKHS is easier to
understand.
 "why is the proposed approach compared to OKL [R1] in
terms of running time but not in performance, performance needs to
compared to [R4,R5]"
For squared loss and Frobenius norm
regularizer on Theta, OKL is the same as our approach denoted FMTL_2S
(p=2 and S for squared loss). As the problem is convex, our and their
solver yield the same result. We got the code from the authors of [R5].
While for the OKL/FMTL_2S formulation, our FMTL_2S is 4.3 times faster
than [R5] with the best hyperparameter setting and on avg 7 times faster
on all hyperparameter values (on SUN dataset). An updated Fig.2 with
timings of [R5] is available
at: http://postimg.org/image/svdw70mcl/ Regarding prediction we
compare with all the formulations studied in [R5], which are: CMTL [9],OKL
[R1],MTRL [R2] and [2]. [2] is a special case of GMTL[8] with 1 group and
we have included this case in our experiments. The approach of [R4]
learns an input and output kernel. We compare prediction performance with
other formulations that learn input and output kernels (MTFL[23], GMTL[8],
see line 314/315). Moreover, the authors of [R4] told us that code for
their method is not available.
Reviewer 3
 The proposed
formulation should have a separate regularization mechanism for every
task. It is known to improve predictive performance
In principle,
we agree. We implicitly have a separate regularization for every task via
the output kernel (Eq (4) and line 144). If the output kernel is a
diagonal matrix, then the inverse diagonal entries correspond to
taskspecific regularization parameters. Thus we tune implicitly task
specific regularization parameters when learning the output kernel.
Implicit regularization has been used in previous multitask approaches
[14,23]. We also compare to approaches which explicitly regularize every
task [R2,9,13] and outperform them.
Reviewer 4
 for
SUN397 dataset, the authors should compare with other MTL methods instead
of just MTLSDCA
MTLSDCA was chosen as they have stateoftheart
results on SUN397 dataset and have a scalable multiclass implementation.
In the meantime we have efficiently implemented MTL [13]. Accuracies are:
42.0(m=5),47.4(m=10),51.7(m=20),57.0(m=50) for SUN (we are better). The
implementation of MTRL [R2] is not efficient for the multiclass setting
which leads to memory problems for SUN. We are implementing a
memoryoptimized version of MTRL (for multiclass) and will report its
results in the final version. Please note that for USPS/MNIST we compare
to GMTL and MTRL.

