{"title": "Adaptive Smoothed Online Multi-Task Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4296, "page_last": 4304, "abstract": "This paper addresses the challenge of jointly learning both the per-task model parameters and the inter-task relationships in a multi-task online learning setting. The proposed algorithm features probabilistic interpretation, efficient updating rules and flexible modulation on whether learners focus on their specific task or on jointly address all tasks. The paper also proves a sub-linear regret bound as compared to the best linear predictor in hindsight. Experiments over three multi-task learning benchmark datasets show advantageous performance of the proposed approach over several state-of-the-art online multi-task learning baselines.", "full_text": "Adaptive Smoothed Online Multi-Task Learning\n\nKeerthiram Murugesan\u2217\nCarnegie Mellon University\nkmuruges@cs.cmu.edu\n\nHanxiao Liu\u2217\n\nCarnegie Mellon University\nhanxiaol@cs.cmu.edu\n\nJaime Carbonell\n\nCarnegie Mellon University\n\njgc@cs.cmu.edu\n\nYiming Yang\n\nCarnegie Mellon University\n\nyiming@cs.cmu.edu\n\nAbstract\n\nThis paper addresses the challenge of jointly learning both the per-task model\nparameters and the inter-task relationships in a multi-task online learning setting.\nThe proposed algorithm features probabilistic interpretation, ef\ufb01cient updating\nrules and \ufb02exible modulation on whether learners focus on their speci\ufb01c task or\non jointly address all tasks. The paper also proves a sub-linear regret bound as\ncompared to the best linear predictor in hindsight. Experiments over three multi-\ntask learning benchmark datasets show advantageous performance of the proposed\napproach over several state-of-the-art online multi-task learning baselines.\n\n1\n\nIntroduction\n\nThe power of joint learning in multiple tasks arises from the transfer of relevant knowledge across\nsaid tasks, especially from information-rich tasks to information-poor ones. Instead of learning\nindividual models, multi-task methods leverage the relationships between tasks to jointly build\na better model for each task. Most existing work in multi-task learning focuses on how to take\nadvantage of these task relationships, either to share data directly [1] or to learn model parameters via\ncross-task regularization techniques [2, 3, 4]. In a broad sense, there are two settings to learn these\ntask relationships 1) batch learning, in which an entire training set is available to the learner 2) online\nlearning, in which the learner sees the data in a sequential fashion. In recent years, online multi-task\nlearning has attracted extensive research attention [5, 6, 7, 8, 9].\nFollowing the online setting, particularly from [6, 7], at each round t, the learner receives a set of K\nobservations from K tasks and predicts the output label for each of these observations. Subsequently,\nthe learner receives the true labels and updates the model(s) as necessary. This sequence is repeated\nover the entire data, simulating a data stream. Our approach follows an error-driven update rule in\nwhich the model for a given task is updated only when the prediction for that task is in error. The goal\nof an online learner is to minimize errors compared to the full hindsight learner. The key challenge\nin online learning with large number of tasks is to adaptively learn the model parameters and the\ntask relationships, which potentially change over time. Without manageable ef\ufb01cient updates at each\nround, learning the task relationship matrix automatically may impose a severe computational burden.\nIn other words, we need to make predictions and update the models in an ef\ufb01cient real time manner.\nWe propose an online learning framework that ef\ufb01ciently learns multiple related tasks by estimating\nthe task relationship matrix from the data, along with the model parameters for each task. We learn\nthe model for each task by sharing data from related task directly. Our model provides a natural\nway to specify the trade-off between learning the hypothesis from each task\u2019s own (possibly quite\n\n\u2217Both student authors contributed equally.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\flimited) data and data from multiple related tasks. We propose an iterative algorithm to learn the task\nparameters and the task-relationship matrix alternatively. We \ufb01rst describe our proposed approach\nunder a batch setting and then extend it to the online learning paradigm. In addition, we provide a\ntheoretical analysis for our online algorithm and show that it can achieve a sub-linear regret compared\nto the best linear predictor in hindsight. We evaluate our model with several state-of-the-art online\nlearning algorithms for multiple tasks.\nThere are many useful application areas for online multi-task learning, including optimizing \ufb01nancial\ntrading, email prioritization, personalized news, and spam \ufb01ltering. Consider the latter, where some\nspam is universal to all users (e.g. \ufb01nancial scams), some messages might be useful to certain af\ufb01nity\ngroups, but spam to most others (e.g. announcements of meditation classes or other special interest\nactivities), and some may depend on evolving user interests. In spam \ufb01ltering each user is a task,\nand shared interests and dis-interests formulate the inter-task relationship matrix. If we can learn\nthe matrix as well as improving models from speci\ufb01c spam/not-spam decisions, we can perform\nmass customization of spam \ufb01ltering, borrowing from spam/not-spam feedback from users with\nsimilar preferences. The primary contribution of this paper is precisely the joint learning of inter-task\nrelationships and its use in estimating per-task model parameters in an online setting.\n\n1.1 Related Work\n\nWhile there is considerable literature in online multi-task learning, many crucial aspects remain largely\nunexplored. Most existing work in online multi-task learning focuses on how to take advantage of\ntask relationships. To achieve this, Lugosi et. al [7] imposed a hard constraint on the K simultaneous\nactions taken by the learner in the expert setting, Agarwal et. al [10] used matrix regularization, and\nDekel et. al [6] proposed a global loss function, as an absolute norm, to tie together the loss values of\nthe individual tasks. Different from existing online multi-task learning models, our paper proposes an\nintuitive and ef\ufb01cient way to learn the task relationship matrix automatically from the data, and to\nexplicitly take into account the learned relationships during model updates.\nCavallanti et. al [8] assumes that task relationships are available a priori. Kshirsagar et. al [11]\ndoes the same but in a more adaptive manner. However such task-relation prior knowledge is either\nunavailable or infeasible to obtain for many applications especially when the number of tasks K\nis large [12] and/or when the manual annotation of task relationships is expensive [13]. Saha et.\nal [9] formulated the learning of task relationship matrix as a Bregman-divergence minimization\nproblem w.r.t. positive de\ufb01nite matrices. The model suffers from high computational complexity as\nsemi-de\ufb01nite programming is required when updating the task relationship matrix at each online\nround. We show that with a different formulation, we can obtain a similar but much cheaper updating\nrule for learning the inter-task weights.\nThe most related work to ours is Shared Hypothesis model (SHAMO) from Crammer and Mansour\n[1], where the key idea is to use a K-means-like procedure that simultaneously clusters different\ntasks and learns a small pool of m (cid:28) K shared hypotheses. Speci\ufb01cally, each task is free to choose\na hypothesis from the pool that better classi\ufb01es its own data, and each hypothesis is learned from\npooling together all the training data that belongs to the same cluster. A similar idea was explored by\nAbernathy et. al [5] under expert settings.\n\n2 Smoothed Multi-Task Learning\n\n2.1 Setup\n\nj )(cid:9)Nj\n\nj , y(i)\n\n1 to N. Let(cid:8)(x(i)\n\nSuppose we are given K tasks where the jth task is associated with Nj training examples. For brevity\nwe consider a binary classi\ufb01cation problem for each task, but the methods generalize to multi-class\nand are also applicable to regression tasks. We denote by [N ] the consecutive integers ranging from\n+ be the training set and\nj \u2208 Rd is the ith instance\n\nbatch empirical loss for task j, respectively, where (z)+ = max(0, z), x(i)\nfrom the jth task and y(i)\nj\nWe start from the motivation of our formulation in Section 2.2, based on which we \ufb01rst propose a\nbatch formulation in Section 2.3. Then, we extend the method to the online setting in Section 2.4.\n\n(cid:0)1 \u2212 y(i)\n\ni=1 and Lj(w) = 1\n\nNj\n\ni\u2208[Nj ]\n\nj , w(cid:105)(cid:1)\n\nj (cid:104)x(i)\n\nis its corresponding true label.\n\n(cid:80)\n\n2\n\n\f2.2 Motivation\nLk(wk),\u2200k \u2208 [K]. However,\nLearning tasks may be addressed independently via w\u2217\nwhen each task has limited training data, it is often bene\ufb01cial to allow information sharing among the\ntasks, which can be achieved via the following optimization:\n\nk = argminwk\n\nw\u2217\nk = argminwk\n\n\u03b7kjLj(wk) \u2200k \u2208 [K]\n\n(1)\n\n(cid:88)\n\nj\u2208[K]\n\ntasks have an identical data distribution, optimization (1) amounts to using(cid:80)\n\nk to do well on the remaining K \u2212 1\nBeyond each task k, optimization (1) encourages hypothesis w\u2217\ntasks thus allowing tasks to borrow information from each other. In the extreme case where the K\nj\u2208[K] Nj examples for\n\ntraining as compared to Nk in independent learning.\nThe weight matrix \u03b7 is in essence a task relationship matrix, and a prior may be manually speci\ufb01ed\naccording to domain knowledge about the tasks. For instance, \u03b7kj would typically be set to a large\nvalue if tasks k and j share similar nature. If \u03b7 = I, (1) reduces to learning tasks independently. It\nis clear that manual speci\ufb01cation of \u03b7 is feasible only when K is small. Moreover, tasks may be\nstatistically correlated even if a domain expert is unavailable to identify an explicit relation, or if\nthe effort required is too great. Hence, it is often desirable to automatically estimate the optimal \u03b7\nadapted to the inter-task problem structure.\nWe propose to learn \u03b7 in a data-driven manner. For the kth task, we optimize\n\nw\u2217\nk, \u03b7\u2217\n\nk = argminwk,\u03b7k\u2208\u0398\n\n\u03b7kjLj(wk) + \u03bbr(\u03b7k)\n\n(2)\n\n(cid:88)\n\nj\u2208[K]\n\nwhere \u0398 de\ufb01nes the feasible domain of \u03b7k, and regularizer r prevents degenerate cases, e.g., where\n\u03b7k becomes an all-zero vector. Optimization (2) shares the same underlying insight with Self-Paced\nLearning (SPL) [14, 15] where the algorithm automatically learns the weights over data points during\ntraining. However, the process and scope in the two methods differ fundamentally: SPL minimizes\nthe weighted loss over datapoints within a single domain, while optimization (2) minimizes the\nweighted loss over multiple tasks across possibly heterogeneous domains.\nA common choice of \u0398 and r(\u03b7k) in SPL is \u0398 = [0, 1]K and r(\u03b7k) = \u2212(cid:107)\u03b7k(cid:107)1. There are several\ndrawbacks of naively applying this type of settings to the multitask scenario: (i) Lack of focus:\nthere is no guarantee that the kth learner will put more focus on the kth task itself. When task k is\nintrinsically dif\ufb01cult, \u03b7\u2217\nk becomes almost independent of the\nkth task. (ii) Weak interpretability, the learned \u03b7\u2217\nk may not be interpretable as it is not directly tied to\nany physical meanings (iii) Lack of worst-case guarantee in the online setting. All those issues will\nbe addressed by our proposed model in the following.\n\nkk could simply be set near zero and w\u2217\n\n2.3 Batch Formulation\nWe parametrize the aforementioned task relationship matrix \u03b7 \u2208 RK\u00d7K as follows:\n\n(3)\nwhere I K \u2208 RK\u00d7K is an identity matrix, P \u2208 RK\u00d7K is a row-stochastic matrix and \u03b1 is a scalar in\n[0, 1]. Task relationship matrix \u03b7 de\ufb01ned as above has the following interpretations:\n\n\u03b7 = \u03b1I K + (1 \u2212 \u03b1) P\n\n1. Concentration Factor \u03b1 quanti\ufb01es the learners\u2019 \u201cconcentration\u201d on their own tasks. Setting\n\u03b1 = 1 amounts to independent learning. We will see from the forthcoming Theorem 1 how\nto specify \u03b1 to ensure the optimality of the online regret bound.\n2. Smoothed Attention Matrix P quanti\ufb01es to which degree the learners are attentive to all tasks.\nSpeci\ufb01cally, de\ufb01ne the kth row of P , namely pk \u2208 \u2206K\u22121, as a probability distribution over\nall tasks where \u2206K\u22121 denotes the probability simplex. Our goal of learning a data-adaptive\n\u03b7 now becomes learning a data-adaptive attention matrix P .\n\nCommon choices about \u03b7 in several existing algorithms are special cases of (3). For instance, domain\nadaptation assumes \u03b1 = 0 and a \ufb01xed row-stochastic matrix P ; in multi-task learning, we obtain the\n\n3\n\n\feffective heuristics of specifying \u03b7 by Cavallanti et. al. [8] when \u03b1 = 1\nthere are m (cid:28) K unique distributions pk, then the problem reduces to SHAMO model [1].\nEquation (3) implies the task relationship matrix \u03b7 is also row-stochastic, where we always reserve\nprobability \u03b1 for the kth task itself as \u03b7kk \u2265 \u03b1. For each learner, the presence of \u03b1 entails a trade off\nbetween learning from other tasks and concentrating on its own task. Note that we do not require\nP to be symmetric due to the asymmetric nature of information transferability\u2014while classi\ufb01ers\ntrained on a resource-rich task can be well transferred to a resource-scarce task, the inverse is not\nusually true. Motivated by the above discussion, our batch formulation instantiates (2) as follows\n\n1+K and P = 1\n\nK 11(cid:62). When\n\nw\u2217\nk, p\u2217\n\nk = argminwk,pk\u2208\u2206K\u22121\n= argminwk,pk\u2208\u2206K\u22121 Ej\u223cM ultinomial(\u03b7k(pk))Lj(wk) \u2212 \u03bbH (pk)\n\nj\u2208[K]\n\n\u03b7kj(pk)Lj(wk) \u2212 \u03bbH (pk)\n\n(4)\n\n(5)\n\n(cid:88)\n\nwhere H(pk) = \u2212(cid:80)\n\nj\u2208[K] pkj log pkj denotes the entropy of distribution pk. Optimization (4) can be\nviewed as to balance between minimizing the cross-task loss with mixture weights \u03b7k and maximizing\nthe smoothness of cross-task attentions. The max-entropy regularization favours a uniform attention\nover all tasks and leads to analytical updating rules for pk (and \u03b7k).\nOptimization (4) is biconvex over wk and pk. With p(t)\noff-the-shelf solvers. With w(t)\n\nk \ufb01xed, solution for wk can be obtained using\n\nk \ufb01xed, solution for pk is given in closed-form:\n\np(t+1)\nkj =\n\n\u03bb Lj (w(t)\nk )\n\n(cid:80)K\ne\u2212 1\u2212\u03b1\nj(cid:48)=1 e\u2212 1\u2212\u03b1\n\n\u03bb Lj(cid:48) (w(t)\nk )\n\n\u2200j \u2208 [K]\n\n(6)\n\nThe exponential updating rule in (6) has an intuitive interpretation. That is, our algorithm attempts\nto use hypothesis w(t)\nk obtained from the kth task to classify training examples in all other tasks.\nTask j will be treated as related to task k if its training examples can be well classi\ufb01ed by wk. The\nintuition is that two tasks are likely to relate to each other if they share similar decision boundaries,\nthus combining their associated data should yield to a stronger model, trained over larger data.\n\n2.4 Online Formulation\n\nIn this section, we extend our batch formulation to the online setting. We assume that all tasks will be\nperformed at each round, though the assumption can be relaxed with some added complexity to the\nk (cid:105) and\nmethod. At time t, the kth task receives a training instance x(t)\nsuffers a loss after y(t) is revealed. Our algorithm follows a error-driven update rule in which the\nmodel is updated only when a task makes a mistake.\nkj (w) = 1 \u2212 y(t)\nLet (cid:96)(t)\nintroduce shorthands (cid:96)(t)\n\nk (cid:105) < 1 and (cid:96)kj(w) = 0 otherwise. For brevity, we\nkj = \u03b7kj(p(t)\nk ).\n\nk , makes a prediction (cid:104)x(t)\n\nj (cid:104)x(t)\nkj = (cid:96)(t)\n\nk , w(t)\n\nj , w(t)\n\nFor the kth task we consider the following optimization problem at each time:\n\nj , w(cid:105) if y(t)\nj (cid:104)x(t)\nkj (w(t)\nk ) and \u03b7(t)\n(cid:88)\n\nw(t+1)\n\nk\n\n, p(t+1)\n\nk\n\n= argmin\n\nC\n\n\u03b7kj(pk)(cid:96)(t)\n\nkj (wk) + (cid:107)wk \u2212 w(t)\n\nk (cid:107)2 + \u03bbDKL\n\nwk,pk\u2208\u2206K\u22121\n\nj\u2208[K]\n\nj\u2208[K] \u03b7kj(pk)(cid:96)(t)\n\nwhere(cid:80)\n\nkj (wk) = Ej\u223cM ulti(\u03b7k(pk))(cid:96)(t)\n\n(cid:0)pk(cid:107)p(t)\nkj (wk) at time t, and negative entropy \u2212H(pk) =(cid:80)\n\nback\u2013Leibler (KL) divergence between current and previous soft-attention distributions. The presence\nof last two terms in (7) allows the model parameters to evolve smoothly over time. Optimization\n(7) is naturally analogous to the batch optimization (4), where the batch loss Lj(wk) is replaced\nby its noisy version (cid:96)(t)\nj pkj log pkj is replaced\nby DKL(pk(cid:107)p(t)\nk ) also known as the relative entropy. We will show the above formulation leads to\nanalytical updating rules for both wk and pk, a desirable property particularly as an online algorithm.\n\nkj (wk), and DKL\n\nk\n\n(cid:1)\n\n(cid:0)pk(cid:107)p(t)\n(cid:1) denotes the Kull-\n\n(7)\n\nk\n\n4\n\n\fSolution for w(t+1)\n\nconditioned on p(t)\n\nk\nw(t+1)\n\nk\n\n= prox(w(t)\n\nk ) = argminwk\n\nk is given in closed-form by the proximal operator\nkj (wk) + (cid:107)wk \u2212 w(t)\nk (cid:107)2\n\n\u03b7kj(p(t)\n\nk )(cid:96)(t)\n\nC\n\n(cid:88)\n(cid:88)\n\nj\u2208[K]\n\n(8)\n\n(9)\n\n(11)\n\n= w(t)\n\nk + C\n\n\u03b7kj(p(t)\n\nk )y(t)\n\nj x(t)\n\nj\n\nj:y(t)\n\nj (cid:104)x(t)\n\nj ,w(t)\n\nk (cid:105)<1\n\nk is also given in closed-form, analogous to mirror descent [16]\n(10)\n\npkj(cid:96)(t)\n\nkj + \u03bbDKL\n\nk\n\n(cid:0)pk(cid:107)p(t)\n\n(cid:1)\n\n(cid:88)\n\nj\u2208[K]\n\nSolution for p(t+1)\n\nk\n\nconditioned on w(t)\n\np(t+1)\nk\n\n= argminpk\u2208\u2206K\u22121 C(1 \u2212 \u03b1)\n(cid:80)\n\n\u2212 C(1\u2212\u03b1)\n\n\u2212 C(1\u2212\u03b1)\n\np(t)\nkj e\nj(cid:48) p(t)\nkj(cid:48)e\n\n(cid:96)(t)\nkj(cid:48)\n\n(cid:96)(t)\nkj\n\n\u03bb\n\n\u03bb\n\nj \u2208 [K]\n\n=\u21d2 p(t+1)\n\nkj =\n\nThe pseudo-code is in Algorithm 22. Our algorithm is \u201cpassive\u201d in the sense that updates are carried\nout only when a classi\ufb01cation error occurs, namely when \u02c6y(t)\nk . An alternative is to perform\nk\n\u201caggressive\u201d updates only when the active set {j : y(t)\n\n(cid:54)= y(t)\nk (cid:105) < 1} is non-empty.\n\nj (cid:104)x(t)\n\nj , w(t)\n\nAlgorithm 1: Batch Algorithm (SMTL-e)\nwhile not converge do\n\nAlgorithm 2: Online Algorithm (OSMTL-e)\nfor t \u2208 [T ] do\n\n\u03b1Lk(wk) + (1 \u2212\n\nfor k \u2208 [K] do\n\n\u03b1)(cid:80)\nk \u2190 argminwk\nw(t)\nj\u2208[K] p(t)\nfor j \u2208 [K] do\n(cid:80)K\nkj \u2190 e\np(t+1)\nj(cid:48)=1\n\nkj Lj(wk);\n\n\u2212 1\u2212\u03b1\n\n\u03bb\n\nLj (w\n\n(t)\nk\n\n)\n\n\u2212 1\u2212\u03b1\n\n\u03bb\n\ne\n\nL\n\nj(cid:48) (w\n\n(t)\nk\n\nend\n\nend\nt \u2190 t + 1;\n\nend\n\nif y(t)\n\nfor k \u2208 [K] do\nk (cid:104)x(t)\nk , w(t)\nk \u2190 w(t)\nw(t+1)\n\nC(1 \u2212 \u03b1)(cid:80)\n\nk (cid:105) < 1 then\nk + C\u03b11\n(t)\n(cid:96)\nkk >0\np(t)\nkj y(t)\n\nj:(cid:96)\n\n(t)\nkj >0\n\nfor j \u2208 [K] do\n\ny(t)\nk x(t)\nj x(t)\n\nk +\n;\n\nj\n\nkj \u2190 p\n(cid:80)K\np(t+1)\nj(cid:48)=1\n\n(t)\nkj e\n\n(t)\n\nkj(cid:48) e\np\n\n\u2212 C(1\u2212\u03b1)\n\n\u03bb\n\n(cid:96)\n\n(t)\nkj\n\n\u2212 C(1\u2212\u03b1)\n\n\u03bb\n\n(cid:96)\n\n(t)\n\nkj(cid:48)\n\n;\n\nw(t+1)\n\nk\n\nk \u2190 w(t)\n, p(t+1)\n\nk , p(t)\nk ;\n\n;\n\n)\n\nelse\n\nend\n\nend\n\nend\n\nend\n\nk\n\n2.5 Regret Bound\n\nk \u2208 Rd, y(t)\n\nwhere x(t)\nand let \u03b1 be some prede\ufb01ned parameter in [0, 1]. Let {w\u2217\nw\u2217\n\n(cid:1)(cid:9)T\nTheorem 1. \u2200k \u2208 [K], let Sk =(cid:8)(cid:0)x(t)\nk , y(t)\nt=1 be a sequence of T examples for the kth task\nk (cid:107)2 \u2264 R, \u2200t \u2208 [T ]. Let C be a positive constant\nk \u2208 {\u22121, +1} and (cid:107)x(t)\nk \u2208 Rd and its hinge loss on the examples(cid:0)x(t)\nk}k\u2208[K] be any arbitrary vectors where\nkj =(cid:0)1 \u2212 y(t)\n(cid:0)1 \u2212 y(t)\nkk =\n+ and (cid:96)(t)\u2217\n(cid:1) \u2264 1\n\n+, respectively.\nIf {Sk}k\u2208[K] is presented to OSMTL algorithm, then \u2200k \u2208 [K] we have\n\n(cid:1) and(cid:0)x(t)\n(cid:16)\n\nk (cid:104)x(t)\nk , w\u2217\n(cid:88)\n\nj(cid:54)=k are given by (cid:96)(t)\u2217\n\n(1 \u2212 \u03b1)T\n\nk(cid:105)(cid:1)\n\nk(cid:105)(cid:1)\n\nj (cid:104)x(t)\n\nj , w\u2217\n\nk , y(t)\n\nj , y(t)\n\n(cid:96)(t)\u2217\n\nCR2T\n\n(cid:107)w\u2217\n\n(cid:17)\n\n(cid:96)(t)\u2217\nkk + max\n\nk(cid:107)2 +\n\nkk \u2212 (cid:96)(t)\u2217\n((cid:96)(t)\n\nkk\n\n(12)\n\n(cid:1)\n\n+\n\nkj\n\nk\n\nj\n\nj\u2208[K],j(cid:54)=k\n\n2\u03b1\n\n2C\u03b1\n\n\u03b1\n\nt\u2208[T ]\n\nNotice when \u03b1 \u2192 1, the above reduces to the perceptron mistake bound [17].\n\n2It is recommended to set \u03b1 \u221d \u221a\n\u221a\nT\nT\n\n1+\n\n\u221a\nand C \u221d 1+\n\nT\n\nT\n\n, as suggested by Corollary 2.\n\n5\n\n\fCorollary 2. Let \u03b1 =\n\nand C = 1+\nT\n\nT\n\n\u221a\n\u221a\nT\nT\n\n1+\n\nkk \u2212 (cid:96)(t)\u2217\n((cid:96)(t)\n\nkk\n\n(cid:88)\n\nt\u2208[T ]\n\n(cid:1) \u2264\n\n\u221a\n\nT\n\n\u221a\n\n(cid:18) 1\n\n2\n\nin Theorem 1, we have\nk(cid:107)2 + (cid:96)(t)\u2217\n\nkk + max\n\n(cid:107)w\u2217\n\nj\u2208[K],j(cid:54)=k\n\n(cid:19)\n\n(cid:96)(t)\u2217\nkj + 2R2\n\n(13)\n\nProofs are given in the supplementary. Theorem 1 and Corollary 2 have several implications:\n\n1. Quality of the bound depends on both (cid:96)(t)\u2217\n\nkk and the maximum of {(cid:96)(t)\u2217\nwords, the worst-case regret will be lower if the kth true hypothesis w\u2217\ntraining examples in both the kth task itself as well as those in all the other tasks.\n\nkj }j\u2208[K],j(cid:54)=k. In other\nk can well distinguish\n\n2. Corollary 2 indicates the difference between the cumulative loss achieved by our algorithm\n\nand by any \ufb01xed hypothesis for task k is bounded by a term growing sub-linearly in T .\n\n\u221a\n\u221a\nT\n\n3. Corollary 2 provides a principled way to set hyperparameters to achieve the sub-linear\nregret bound. Speci\ufb01cally, recall \u03b1 quanti\ufb01es the self-concentration of each task. Therefore,\nT\u2192\u221e\u2212\u2192 1 implies for large horizon it would be less necessary to rely on other tasks\n\u03b1 =\nT\u2192\u221e\u2212\u2192 0 suggests\n\nas available supervision for the task itself is already plenty; C = 1+\nT\ndiminishing learning rate over the horizon length.\n\n\u221a\n\nT\n\n1+\n\nT\n\n3 Experiments\n\nWe evaluate the performance of our algorithm under batch and online settings. All reported results in\nthis section are averaged over 30 random runs or permutations of the training data. Unless otherwise\nspeci\ufb01ed, all model parameters are chosen via 5-fold cross validation.\n\n3.1 Benchmark Datasets\n\nWe use three datasets for our experiments. Details are given below:\nLandmine Detection3 consists of 19 tasks collected from different landmine \ufb01elds. Each task is a\nbinary classi\ufb01cation problem: landmines (+) or clutter (\u2212) and each example consists of 9 features\nextracted from radar images with four moment-based features, three correlation-based features, one\nenergy ratio feature and a spatial variance feature. Landmine data is collected from two different\nterrains: tasks 1-10 are from highly foliated regions and tasks 11-19 are from desert regions, therefore\ntasks naturally form two clusters. Any hypothesis learned from a task should be able to utilize the\ninformation available from other tasks belonging to the same cluster.\nSpam Detection4 We use the dataset obtained from ECML PAKDD 2006 Discovery challenge for\nthe spam detection task. We used the task B challenge dataset which consists of labeled training\ndata from the inboxes of 15 users. We consider each user as a single task and the goal is to build\na personalized spam \ufb01lter for each user. Each task is a binary classi\ufb01cation problem: spam (+)\nor non-spam (\u2212) and each example consists of approximately 150K features representing term\nfrequency of the word occurrences. Since some spam is universal to all users (e.g. \ufb01nancial scams),\nsome messages might be useful to certain af\ufb01nity groups, but spam to most others. Such adaptive\nbehavior of user\u2019s interests and dis-interests can be modeled ef\ufb01ciently by utilizing the data from\nother users to learn per-user model parameters.\nSentiment Analysis5 We evaluated our algorithm on product reviews from amazon. The dataset\ncontains product reviews from 24 domains. We consider each domain as a binary classi\ufb01cation task.\nReviews with rating > 3 were labeled positive (+), those with rating < 3 were labeled negative (\u2212),\nreviews with rating = 3 are discarded as the sentiments were ambiguous and hard to predict. Similar\nto the previous dataset, each example consists of approximately 350K features representing term\nfrequency of the word occurrences.\nWe choose 3040 examples (160 training examples per task) for landmine, 1500 emails for spam (100\nemails per user inbox) and 2400 reviews for sentiment (100 reviews per domain) for our experiments.\n\n3http://www.ee.duke.edu/~lcarin/LandmineData.zip\n4http://ecmlpkdd2006.org/challenge.html\n5http://www.cs.jhu.edu/~mdredze/datasets/sentiment\n\n6\n\n\fFigure 1: Average AU C calculated for compared models (left). A visualization of the task relationship\nmatrix in Landmine learned by SMTL-t (middle) and SMTL-e (right). The probabilistic formulation\nof SMTL-e allows it to discover more interesting patterns than SMTL-t.\n\nNote that we intentionally kept the size of the training data small to drive the need for learning from\nother tasks, which diminishes as the training sets per task become large. Since all these datasets have\na class-imbalance issue (with few (+) examples as compared to (\u2212) examples), we use average Area\nUnder the ROC Curve (AU C) as the performance measure.\n\n3.2 Batch Setting\n\nkj )+\n\nkj \u221d (\u03bb \u2212 (cid:96)(t)\n\nSince the main focus of this paper is online learning, we brie\ufb02y conduct an experiment on landmine\ndetection dataset for our batch learning to demonstrate the advantages of learning from shared data.\nWe implement two versions of our proposed algorithm with different updates: SMTL-t (SMTL with\nthresholding updates) where p(t+1)\n6 and SMTL-e (SMTL with exponential updates)\nas in Algorithm 1. We compare our SMTL* with two standard baseline methods for our batch setting:\nIndependent Task Learning (ITL)\u2014learning a single model for each task and Single Task Learning\n(STL)\u2014learning a single classi\ufb01cation model for pooled data from all the tasks. In addition we\ncompare our models with SHAMO, which is closest in spirit with our proposed models. We select the\nvalue for \u03bb and \u03b1 for SMTL* and M for SHAMO using cross validation.\nFigure 1 (left) shows the average AU C calculated for different training size on landmine. We can see\nthat the baseline results are similar to the ones reported by Xue et. al [3]. Our proposed algorithm\n(SMTL*) outperforms the other baselines but when we have very few training examples (say 20 per\ntask), the performance of STL improves as it has more examples than the others. Since \u03b7 depends\non the loss incurred on the data from related tasks, this loss-based measure can be unreliable for a\nsmall training sample size. To our surprise, SHAMO performs worse than the other models which\ntells us that assuming two tasks are exactly same (in the sense of hypothesis) may be inappropriate in\nreal-world applications. Figure 1 (middle & left) show the task relationship matrix \u03b7 for SMTL-t and\nSMTL-e on landmine when the number of training instances is 160 per task.\n\n3.3 Online Setting\n\nTo evaluate the performance of our algorithm in the online setting, we use all three datasets (landmine,\nspam and sentiment) and compare our proposed methods to 5 baselines. We implemented two\nvariations of Passive-Aggressive algorithm (PA) [18]. PA-ITL learns independent model for each task\nand PA-ONE builds a single model for all the tasks. We also implemented the algorithm proposed by\nDekel et. al for online multi-task learning with shared loss (OSGL) [6]. These three baselines do not\nexploit the task-relationship or the data from other tasks during model update. Next, we implemented\ntwo online multi-task learning related to our approach: FOML \u2013 initializes \u03b7 with \ufb01xed weights [8],\nOnline Multi-Task Relationship Learning (OMTRL) [9] \u2013 learns a task covariance matrix along with\ntask parameters. We could not \ufb01nd a better way to implement the online version of the SHAMO\nalgorithm, since the number of shared hypotheses or clusters varies over time.\n\n6Our algorithm and theorem can be easily generalized to other types of updating rules by replacing exp in\n\n(6) with other functions. In latter cases, however, \u03b7 may no longer have probabilistic interpretations.\n\n7\n\n 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0 50 100 150 200 250 300AUCTraining SizeSTLITLSHAMOSMTL-tSMTL-e12345678910111213141516171819123456789101112131415161718191234567891011121314151617181912345678910111213141516171819\fTable 1: Average performance on three datasets: means and standard errors over 30 random shuf\ufb02es.\n\nLandmine Detection\n\nSpam Detection\n\nSentiment Analysis\n\nTime (s)\n\nTime (s)\n\nTime (s)\n\nModels\n\nPA-ONE\n\nPA-ITL\n\nOSGL\n\nFOML\n\nOMTRL\n\nOSMTL-t\n\nOSMTL-e\n\nAUC\n0.5473\n(0.12)\n0.5986\n(0.04)\n0.6482\n(0.03)\n0.6322\n(0.04)\n0.6409\n(0.05)\n0.6776\n(0.03)\n0.6404\n(0.04)\n\nnSV\n2902.9\n(4.21)\n618.1\n(27.31)\n740.8\n(42.03)\n426.5\n(36.91)\n432.2\n(123.81)\n333.6\n(40.66)\n\n458\n\n(36.79)\n\n0.01\n\n0.01\n\n0.01\n\n0.11\n\n6.9\n\n0.18\n\n0.19\n\nAUC\n0.8739\n(0.01)\n0.8350\n(0.01)\n0.9551\n(0.007)\n0.9347\n(0.009)\n0.9343\n(0.008)\n0.9509\n(0.007)\n0.9596\n(0.006)\n\nnSV\n1455.0\n(4.64)\n1499.9\n(0.37)\n1402.6\n(13.57)\n819.8\n(18.57)\n840.4\n(22.67)\n809.5\n(19.35)\n804.2\n(19.05)\n\n0.16\n\n0.16\n\n0.17\n\n1.5\n\n53.6\n\n1.4\n\n1.3\n\nAUC\n0.7193\n(0.03)\n0.7364\n(0.02)\n0.8375\n(0.02)\n0.8472\n(0.02)\n0.7831\n(0.02)\n0.9354\n0.01\n0.9465\n(0.01)\n\nnSV\n2350.7\n(6.36)\n2399.9\n(0.25)\n2369.3\n(14.63)\n1356.0\n(78.49)\n1346.2\n(85.99)\n1312.8\n(79.15)\n1322.2\n(80.27)\n\n0.19\n\n0.16\n\n0.17\n\n1.20\n\n128\n\n2.15\n\n2.16\n\nTable 1 summarizes the performance of all the above algorithms on the three datasets. In addition to\nthe AU C scores, we report the average total number of support vectors (nSV) and the CPU time taken\nfor learning from one instance (Time). From the table, it is evident that OSMTL* outperforms all the\nbaselines in terms of both AU C and nSV. This is expected for the two default baselines (PA-ITL and\nPA-ONE). We believe that PA-ONE shows better result than PA-ITL in spam because the former learns\nthe global information (common spam emails) that is quite dominant in spam detection problem. The\nupdate rule for FOML is similar to ours but using \ufb01xed weights. The results justify our claim that\nmaking the weights adaptive leads to improved performance.\nIn addition to better results, our algorithm consumes less or comparable CPU time than the baselines\nwhich take into account inter-task relationships. Compared to the OMTRL algorithm that recomputes\nthe task covariance matrix every iteration using expensive SVD routines, the adaptive weights in\nour are updated independently for each task. As speci\ufb01ed in [9], we learn the task weight vectors\nfor OMTRL separately as K independent perceptron for the \ufb01rst half of the training data available\n(EPOCH=0.5). OMTRL potentially looses half the data without learning task-relationship matrix as\nit depends on the quality of the task weight vectors.\nIt is evident from the table that algorithms which use loss-based update weights \u03b7 (OSGL, OSMTL*)\nconsiderably outperform the ones that do not use it (FOML,OMTRL). We believe that loss incurred\nper instance gives us valuable information for the algorithm to learn from that instance, as well as to\nevaluate the inter-dependencies among tasks. That said, task relationship information does help by\nlearning from the related tasks\u2019 data, but we demonstrate that combining both the task relationship\nand the loss information can give us a better algorithm, as is evident from our experiments.\nWe would like to note that our proposed algorithm OSMTL* does exceptionally better in sentiment,\nwhich has been used as a standard benchmark application for domain adaptation experiments in\nthe existing literature [19]. We believe the advantageous results on sentiment dataset implies that\neven with relatively few examples, effectively knowledge transfer among the tasks/domains can be\nachieved by adaptively choosing the (probabilistic) inter-task relationships from the data.\n\n4 Conclusion\n\nWe proposed a novel online multi-task learning algorithm that jointly learns the per-task hypothesis\nand the inter-task relationships. The key idea is based on smoothing the loss function of each task\nw.r.t. a probabilistic distribution over all tasks, and adaptively re\ufb01ning such distribution over time. In\naddition to closed-form updating rules, we show our method achieves the sub-linear regret bound.\nEffectiveness of our algorithm is empirically veri\ufb01ed over several benchmark datasets.\n\nAcknowledgments\n\nThis work is supported in part by NSF under grants IIS-1216282 and IIS-1546329.\n\n8\n\n\fReferences\n[1] Koby Crammer and Yishay Mansour. Learning multiple tasks using shared hypotheses. In\n\nAdvances in Neural Information Processing Systems, pages 1475\u20131483, 2012.\n\n[2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature\n\nlearning. Machine Learning, 73(3):243\u2013272, 2008.\n\n[3] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for\nclassi\ufb01cation with dirichlet process priors. The Journal of Machine Learning Research, 8:35\u201363,\n2007.\n\n[4] Yu Zhang and Dit-Yan Yeung. A regularization approach to learning task relationships in\nmultitask learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):12,\n2014.\n\n[5] Jacob Abernethy, Peter Bartlett, and Alexander Rakhlin. Multitask learning with expert advice.\n\nIn Learning Theory, pages 484\u2013498. Springer, 2007.\n\n[6] Ofer Dekel, Philip M Long, and Yoram Singer. Online learning of multiple tasks with a shared\n\nloss. Journal of Machine Learning Research, 8(10):2233\u20132264, 2007.\n\n[7] G\u00e1bor Lugosi, Omiros Papaspiliopoulos, and Gilles Stoltz. Online multi-task learning with\n\nhard constraints. arXiv preprint arXiv:0902.3526, 2009.\n\n[8] Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online\n\nmultitask classi\ufb01cation. The Journal of Machine Learning Research, 11:2901\u20132934, 2010.\n\n[9] Avishek Saha, Piyush Rai, Suresh Venkatasubramanian, and Hal Daume. Online learning of\nmultiple tasks and their relationships. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 643\u2013651, 2011.\n\n[10] Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization techniques for\nonline multitask learning. EECS Department, University of California, Berkeley, Tech. Rep.\nUCB/EECS-2008-138, 2008.\n\n[11] Meghana Kshirsagar, Jaime Carbonell, and Judith Klein-Seetharaman. Multisource transfer\nlearning for host-pathogen protein interaction prediction in unlabeled tasks. In NIPS Workshop\non Machine Learning for Computational Biology, 2013.\n\n[12] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature\nhashing for large scale multitask learning. In Proceedings of the 26th Annual International\nConference on Machine Learning, pages 1113\u20131120. ACM, 2009.\n\n[13] Meghana Kshirsagar, Jaime Carbonell, and Judith Klein-Seetharaman. Multitask learning for\n\nhost\u2013pathogen protein interactions. Bioinformatics, 29(13):i217\u2013i226, 2013.\n\n[14] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable\n\nmodels. In Advances in Neural Information Processing Systems, pages 1189\u20131197, 2010.\n\n[15] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann.\nSelf-paced learning with diversity. In Advances in Neural Information Processing Systems,\npages 2078\u20132086, 2014.\n\n[16] A-S Nemirovsky, D-B Yudin, and E-R Dawson. Problem complexity and method ef\ufb01ciency in\n\noptimization. 1982.\n\n[17] Shai Shalev-Shwartz and Yoram Singer. Online learning: Theory, algorithms, and applications.\n\nPhD Dissertation, 2007.\n\n[18] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online\npassive-aggressive algorithms. The Journal of Machine Learning Research, 7:551\u2013585, 2006.\n\n[19] John Blitzer, Mark Dredze, Fernando Pereira, et al. Biographies, bollywood, boom-boxes and\nblenders: Domain adaptation for sentiment classi\ufb01cation. In ACL, volume 7, pages 440\u2013447,\n2007.\n\n9\n\n\f", "award": [], "sourceid": 2131, "authors": [{"given_name": "Keerthiram", "family_name": "Murugesan", "institution": "Carnegie Mellon University"}, {"given_name": "Hanxiao", "family_name": "Liu", "institution": "Carnegie Mellon University"}, {"given_name": "Jaime", "family_name": "Carbonell", "institution": "CMU"}, {"given_name": "Yiming", "family_name": "Yang", "institution": "CMU"}]}